Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobsetEvals: record evaluation errors #847

Merged
merged 6 commits into from Jan 22, 2021

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented Jan 21, 2021

Today, Hydra records evaluation errors in the Jobsets table only. When a new evaluation happens, the old log is lost. This can be pretty annoying if you're trying to debug an issue and a new evaluation finishes while you're hunting.

This PR adds evaluation errors to the jobsetevals table and writes the evaluation log to it in addition to storing the latest evaluation error to jobsets.

The included migration copies the jobset's evaluation error output to the latest jobsetevals record for that jobset. It shouldn't take long to execute this migration: it takes a minute or so to run against hydra's production dataset, on a significantly underpowered server.

It looks like this:


@edolstra
Copy link
Member

Not sure about this. The reason we don't keep old evaluation logs is that they're huge (especially for the Nixpkgs/NixOS jobsets) so this would clutter the database a lot.

@grahamc
Copy link
Member Author

grahamc commented Jan 21, 2021

What if we had a process by which only the most recent ~5 per jobset kept logs? This could be implemented as part of the evaluator (delete the jobset's old logs) for example. This would keep the clutter down, keep a bit of history, and possibly make the "how many" variable.

@samueldr
Copy link
Member

@edolstra The alternative (current situation) is that we cannot trace when an evaluation error first started happening. It would help reduce the bounds to bisect from.

Assuming this failed for four evals:

  • last good eval (aaaaaaaa...bbbbbbbb)
  • first failure (bbbbbbbb...ccccccccc)
  • (repeat) (cccccccc...dddddddd)
  • (repeat) (dddddddd...eeeeeeee)
  • (repeat) (eeeeeeee...ffffffff)

With the current situation, we have to bisect from bbbbbbbb to ffffffff.

With these changes (AFAIUI) we can assume at least one failure is found between bbbbbbbb and cccccccc.


And then, this is only for when evaluation is assumed to be reproducibly healthy.

Imagine we face another issue like the one from before the new evaluator, where sometimes eval failed spuriously. Currently we lose the data about failed evals, which can make understanding, or even proving the issue harder.


Though, keeping old historical data I understand is of much less value. Would it be less of an issue if they weren't kept in the database, but as files?

@AmineChikhaoui
Copy link
Member

What about moving away from storing eval logs in the database and instead do something like build logs either in the filesystem (/var/lib/hydra/eval-logs/) or S3 ?

@edolstra
Copy link
Member

Would it be less of an issue if they weren't kept in the database, but as files?

No, since it would be on the same disk. Uploading to S3 would probably be a bit too much effort.

What if we had a process by which only the most recent ~5 per jobset kept logs?

Yeah that would be good. No need to add it to this PR, we can just run a systemd timer on hydra.nixos.org to delete old logs for the nixpkgs/nixos jobsets.

Copy link
Member

@samueldr samueldr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with the SQL parts and the migration. So don't assume I approve of it outright, but it does look okay.


There is that one </td> that I think is wrong. Though I'm spinning up a dev instance with this PR and causing errors to see.

Comment on lines +2 to +4
system("initdb -D postgres --locale C.UTF-8 ") == 0 or die;
system("pg_ctl -D postgres -o \"-F -p 6433 -h '' -k /tmp \" -w start") == 0 or die;
system("createdb -l C.UTF-8 -p 6433 hydra-test-suite") == 0 or die;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the live DB is setup the same, right?

Anything else locale-wise that could differ and bite us?

Copy link
Member Author

@grahamc grahamc Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h.n.o is actually using en_US.UTF-8 but that felt a bit too us-centric. The major issue this is addressing is we're writing wide characters to the DB, so I think any UTF-8 should do.

src/root/common.tt Outdated Show resolved Hide resolved
@grahamc grahamc force-pushed the jobsetevals-evaluation-errors branch from 6b62d8e to bd99052 Compare January 21, 2021 22:08
Copy link
Member

@samueldr samueldr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, looks fine here.

@edolstra edolstra merged commit 53c2fc2 into NixOS:master Jan 22, 2021
@grahamc grahamc deleted the jobsetevals-evaluation-errors branch January 22, 2021 14:09
@grahamc
Copy link
Member Author

grahamc commented Jan 22, 2021

🥳 thanks!

@grahamc
Copy link
Member Author

grahamc commented Feb 1, 2021

This created a fairly significant amount of extra data transfer, as DBIx automatically fetches the column's value. It appears that disabling this column by default is not trivial, and that a simpler option would be having a separate table for evaluation errors. I'm going to work on this and try to get it in for review. In the meantime I've deleted all the evaluation messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants