JobsetEvals: record evaluation errors #847

grahamc · 2021-01-21T18:15:51Z

Today, Hydra records evaluation errors in the Jobsets table only. When a new evaluation happens, the old log is lost. This can be pretty annoying if you're trying to debug an issue and a new evaluation finishes while you're hunting.

This PR adds evaluation errors to the jobsetevals table and writes the evaluation log to it in addition to storing the latest evaluation error to jobsets.

The included migration copies the jobset's evaluation error output to the latest jobsetevals record for that jobset. It shouldn't take long to execute this migration: it takes a minute or so to run against hydra's production dataset, on a significantly underpowered server.

It looks like this:

edolstra · 2021-01-21T20:18:29Z

Not sure about this. The reason we don't keep old evaluation logs is that they're huge (especially for the Nixpkgs/NixOS jobsets) so this would clutter the database a lot.

grahamc · 2021-01-21T20:19:28Z

What if we had a process by which only the most recent ~5 per jobset kept logs? This could be implemented as part of the evaluator (delete the jobset's old logs) for example. This would keep the clutter down, keep a bit of history, and possibly make the "how many" variable.

samueldr · 2021-01-21T21:24:16Z

@edolstra The alternative (current situation) is that we cannot trace when an evaluation error first started happening. It would help reduce the bounds to bisect from.

Assuming this failed for four evals:

last good eval (aaaaaaaa...bbbbbbbb)
first failure (bbbbbbbb...ccccccccc)
(repeat) (cccccccc...dddddddd)
(repeat) (dddddddd...eeeeeeee)
(repeat) (eeeeeeee...ffffffff)

With the current situation, we have to bisect from bbbbbbbb to ffffffff.

With these changes (AFAIUI) we can assume at least one failure is found between bbbbbbbb and cccccccc.

And then, this is only for when evaluation is assumed to be reproducibly healthy.

Imagine we face another issue like the one from before the new evaluator, where sometimes eval failed spuriously. Currently we lose the data about failed evals, which can make understanding, or even proving the issue harder.

Though, keeping old historical data I understand is of much less value. Would it be less of an issue if they weren't kept in the database, but as files?

AmineChikhaoui · 2021-01-21T21:29:49Z

What about moving away from storing eval logs in the database and instead do something like build logs either in the filesystem (/var/lib/hydra/eval-logs/) or S3 ?

edolstra · 2021-01-21T21:47:00Z

Would it be less of an issue if they weren't kept in the database, but as files?

No, since it would be on the same disk. Uploading to S3 would probably be a bit too much effort.

What if we had a process by which only the most recent ~5 per jobset kept logs?

Yeah that would be good. No need to add it to this PR, we can just run a systemd timer on hydra.nixos.org to delete old logs for the nixpkgs/nixos jobsets.

samueldr

I'm not comfortable with the SQL parts and the migration. So don't assume I approve of it outright, but it does look okay.

There is that one </td> that I think is wrong. Though I'm spinning up a dev instance with this PR and causing errors to see.

samueldr · 2021-01-21T20:10:33Z

tests/set-up.pl

+system("initdb -D postgres --locale C.UTF-8 ") == 0 or die;
+system("pg_ctl -D postgres -o \"-F -p 6433  -h '' -k /tmp \" -w start") == 0 or die;
+system("createdb -l C.UTF-8 -p 6433 hydra-test-suite") == 0 or die;


I assume the live DB is setup the same, right?

Anything else locale-wise that could differ and bite us?

h.n.o is actually using en_US.UTF-8 but that felt a bit too us-centric. The major issue this is addressing is we're writing wide characters to the DB, so I think any UTF-8 should do.

src/root/common.tt

Otherwise tests may fail with wide character errors.

samueldr

FWIW, looks fine here.

grahamc · 2021-01-22T14:09:48Z

🥳 thanks!

grahamc · 2021-02-01T21:27:25Z

This created a fairly significant amount of extra data transfer, as DBIx automatically fetches the column's value. It appears that disabling this column by default is not trivial, and that a simpler option would be having a separate table for evaluation errors. I'm going to work on this and try to get it in for review. In the meantime I've deleted all the evaluation messages.

grahamc added 4 commits January 21, 2021 13:10

Schema: add errorMsg, errorTime to JobsetEvals

d9989b7

gitignore: artifacts

fb6b10a

hydra-eval-jobs: write evaluation errorMsg to the jobseteval table

086eed5

Evaluation page: render evaluation errors

805dd6e

andir approved these changes Jan 21, 2021

View reviewed changes

samueldr suggested changes Jan 21, 2021

View reviewed changes

grahamc added 2 commits January 21, 2021 17:08

jobset page: render error labels per eval

c64c4aa

tests: create database with the utf-8 locale

bd99052

Otherwise tests may fail with wide character errors.

grahamc force-pushed the jobsetevals-evaluation-errors branch from 6b62d8e to bd99052 Compare January 21, 2021 22:08

samueldr approved these changes Jan 21, 2021

View reviewed changes

grahamc mentioned this pull request Jan 22, 2021

Normalize nixexpr{input,path} from builds to jobsetevals. #848

Merged

edolstra merged commit 53c2fc2 into NixOS:master Jan 22, 2021

grahamc deleted the jobsetevals-evaluation-errors branch January 22, 2021 14:09

grahamc mentioned this pull request Jan 28, 2021

Hydra Maintenance: Complete. Please report issues here. NixOS/nixpkgs#111017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobsetEvals: record evaluation errors #847

JobsetEvals: record evaluation errors #847

grahamc commented Jan 21, 2021 •

edited

edolstra commented Jan 21, 2021

grahamc commented Jan 21, 2021 •

edited

samueldr commented Jan 21, 2021

AmineChikhaoui commented Jan 21, 2021

edolstra commented Jan 21, 2021

samueldr left a comment

samueldr Jan 21, 2021

grahamc Jan 21, 2021 •

edited

samueldr left a comment

grahamc commented Jan 22, 2021

grahamc commented Feb 1, 2021

JobsetEvals: record evaluation errors #847

JobsetEvals: record evaluation errors #847

Conversation

grahamc commented Jan 21, 2021 • edited

edolstra commented Jan 21, 2021

grahamc commented Jan 21, 2021 • edited

samueldr commented Jan 21, 2021

AmineChikhaoui commented Jan 21, 2021

edolstra commented Jan 21, 2021

samueldr left a comment

Choose a reason for hiding this comment

samueldr Jan 21, 2021

Choose a reason for hiding this comment

grahamc Jan 21, 2021 • edited

Choose a reason for hiding this comment

samueldr left a comment

Choose a reason for hiding this comment

grahamc commented Jan 22, 2021

grahamc commented Feb 1, 2021

grahamc commented Jan 21, 2021 •

edited

grahamc commented Jan 21, 2021 •

edited

grahamc Jan 21, 2021 •

edited