New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JobsetEvals: record evaluation errors #847
Conversation
Not sure about this. The reason we don't keep old evaluation logs is that they're huge (especially for the Nixpkgs/NixOS jobsets) so this would clutter the database a lot. |
What if we had a process by which only the most recent ~5 per jobset kept logs? This could be implemented as part of the evaluator (delete the jobset's old logs) for example. This would keep the clutter down, keep a bit of history, and possibly make the "how many" variable. |
@edolstra The alternative (current situation) is that we cannot trace when an evaluation error first started happening. It would help reduce the bounds to bisect from. Assuming this failed for four evals:
With the current situation, we have to bisect from With these changes (AFAIUI) we can assume at least one failure is found between And then, this is only for when evaluation is assumed to be reproducibly healthy. Imagine we face another issue like the one from before the new evaluator, where sometimes eval failed spuriously. Currently we lose the data about failed evals, which can make understanding, or even proving the issue harder. Though, keeping old historical data I understand is of much less value. Would it be less of an issue if they weren't kept in the database, but as files? |
What about moving away from storing eval logs in the database and instead do something like build logs either in the filesystem (/var/lib/hydra/eval-logs/) or S3 ? |
No, since it would be on the same disk. Uploading to S3 would probably be a bit too much effort.
Yeah that would be good. No need to add it to this PR, we can just run a systemd timer on hydra.nixos.org to delete old logs for the nixpkgs/nixos jobsets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with the SQL parts and the migration. So don't assume I approve of it outright, but it does look okay.
There is that one </td>
that I think is wrong. Though I'm spinning up a dev instance with this PR and causing errors to see.
system("initdb -D postgres --locale C.UTF-8 ") == 0 or die; | ||
system("pg_ctl -D postgres -o \"-F -p 6433 -h '' -k /tmp \" -w start") == 0 or die; | ||
system("createdb -l C.UTF-8 -p 6433 hydra-test-suite") == 0 or die; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the live DB is setup the same, right?
Anything else locale-wise that could differ and bite us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h.n.o is actually using en_US.UTF-8 but that felt a bit too us-centric. The major issue this is addressing is we're writing wide characters to the DB, so I think any UTF-8 should do.
Otherwise tests may fail with wide character errors.
6b62d8e
to
bd99052
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, looks fine here.
🥳 thanks! |
This created a fairly significant amount of extra data transfer, as DBIx automatically fetches the column's value. It appears that disabling this column by default is not trivial, and that a simpler option would be having a separate table for evaluation errors. I'm going to work on this and try to get it in for review. In the meantime I've deleted all the evaluation messages. |
Today, Hydra records evaluation errors in the Jobsets table only. When a new evaluation happens, the old log is lost. This can be pretty annoying if you're trying to debug an issue and a new evaluation finishes while you're hunting.
This PR adds evaluation errors to the jobsetevals table and writes the evaluation log to it in addition to storing the latest evaluation error to jobsets.
The included migration copies the jobset's evaluation error output to the latest jobsetevals record for that jobset. It shouldn't take long to execute this migration: it takes a minute or so to run against hydra's production dataset, on a significantly underpowered server.
It looks like this: