Fix issue #706: do not use locally cached last build ID. #776

kquick · 2020-06-05T04:55:39Z

Relies on database optimizations instead of local lookup
optimizations, but resolves the issue where the local ID was being
updated prematurely and leaving builds permanently (until the next
hydra restart) in the unstarted condition.

Previous to this change, builds were frequently left in a Pending status and required manual starting or a restart of the hydra-queue-runner. After implementing this change, the local hydra instance has run for several months without any builds left Pending.

Relies on database optimizations instead of local lookup optimizations, but resolves the issue where the local ID was being updated prematurely and leaving builds permanently (until the next hydra restart) in the unstarted condition.

edolstra · 2020-06-05T15:24:23Z

src/hydra-queue-runner/queue-monitor.cc

    {
        pqxx::work txn(conn);

        auto res = txn.exec_params
            ("select id, project, jobset, job, drvPath, maxsilent, timeout, timestamp, globalPriority, priority from Builds "
-             "where id > $1 and finished = 0 order by globalPriority desc, id",
-            lastBuildId);
+             "where finished = 0 order by globalPriority desc, id");


Doesn't this cause it to select the same builds over and over (at least until they're finished)?

Thanks for reviewing this, @edolstra.

As I understood the code, this particular section is building the newIDs list that is then processed (by the last for loop in this same function) to call createBuild which will then try createStep for the build's drvPath but the latter will find the step already existing in the buildsteps table and simply return that without adding to newSteps. The result of that will be a call to propagatePriorities and a message, which I observe periodically in the logs:

Jun 05 00:01:40 ___ hydra-queue-runner[14191]: added build 3836338 (top-level step /nix/store/___.drv, 0 new steps)

I suppose the propagatePriorities call and the log message could be skipped if newSteps.empty(). I can make this change if you agree with this tactic.

Ultimately yes, this causes the queue runner to do some extra work re-checking builds that are already queued until they finish, but I think that cost is relatively low compared to leaving stranded pending builds that never get processed. More efficiency could probably be gained by adding another state to to the Builds table (perhaps to buildStatus), but that would be a considerably more impactful change which I tried to avoid in this PR.

On hydra.nixos.org, we can have ~100000 unfinished builds in the database, so that means we would be constantly fetching the same 100000 builds from the database (and processing them). That's a lot of database traffic...

OK, that's about 2.5x of our local peak, so I can understand the scale of the your concern, @edolstra . Would you be open to a modification that used a DB field or status update of an existing field to indicate this processing has been completed? Or did you have other suggestions?

edolstra · 2020-06-10T09:10:14Z

The real bug here may be that

if (id > newLastBuildId) newLastBuildId = id;

interacts poorly with order by globalPriority desc, id. I.e. the original assumption was that builds get processed in strictly increasing ID order, but then the priority was added which messes that up.

kquick · 2020-06-11T05:38:25Z

I think there would be a bad interaction between globalPriority-based sorting and a local lastBuildId, but I don't think that's the culprit for my situation: there are only a handful of records in my 4 million row Builds table that don't have a globalPriority of 0.

I added an update patch (6c10510) that sets the startTime of the Build record when it is initially processed. These are all initially created as zero, and there is no other query that references startTime that does not also use stopTime (and usually finished), so this should be a viable indicator that the queue-monitor has processed the Build record that doesn't create problems elsewhere. This approach should cause each Build record to be processed only once, and it should also work just fine relative to the globalPriority ordering.

Please let me know what you think of this approach, @edolstra. Thanks!

kquick · 2020-06-19T05:30:54Z

@edolstra this latest change is running locally with good performance and not "lost" builds, so I think its ready for you to take another look and merge if it looks reasonable to you. Thanks!

kquick mentioned this pull request Jun 5, 2020

No step created for build #706

Open

edolstra reviewed Jun 5, 2020

View reviewed changes

Use startTime to prevent processing builds multiple times.

6c10510

Clarify Build.startTime comment.

192b75b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #706: do not use locally cached last build ID. #776

Fix issue #706: do not use locally cached last build ID. #776

kquick commented Jun 5, 2020

edolstra Jun 5, 2020

kquick Jun 5, 2020

edolstra Jun 10, 2020

kquick Jun 10, 2020

edolstra commented Jun 10, 2020

kquick commented Jun 11, 2020

kquick commented Jun 19, 2020

Fix issue #706: do not use locally cached last build ID. #776

Are you sure you want to change the base?

Fix issue #706: do not use locally cached last build ID. #776

Conversation

kquick commented Jun 5, 2020

edolstra Jun 5, 2020

Choose a reason for hiding this comment

kquick Jun 5, 2020

Choose a reason for hiding this comment

edolstra Jun 10, 2020

Choose a reason for hiding this comment

kquick Jun 10, 2020

Choose a reason for hiding this comment

edolstra commented Jun 10, 2020

kquick commented Jun 11, 2020

kquick commented Jun 19, 2020