Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #706: do not use locally cached last build ID. #776

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kquick
Copy link
Contributor

@kquick kquick commented Jun 5, 2020

Relies on database optimizations instead of local lookup
optimizations, but resolves the issue where the local ID was being
updated prematurely and leaving builds permanently (until the next
hydra restart) in the unstarted condition.

Previous to this change, builds were frequently left in a Pending status and required manual starting or a restart of the hydra-queue-runner. After implementing this change, the local hydra instance has run for several months without any builds left Pending.

Relies on database optimizations instead of local lookup
optimizations, but resolves the issue where the local ID was being
updated prematurely and leaving builds permanently (until the next
hydra restart) in the unstarted condition.
{
pqxx::work txn(conn);

auto res = txn.exec_params
("select id, project, jobset, job, drvPath, maxsilent, timeout, timestamp, globalPriority, priority from Builds "
"where id > $1 and finished = 0 order by globalPriority desc, id",
lastBuildId);
"where finished = 0 order by globalPriority desc, id");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this cause it to select the same builds over and over (at least until they're finished)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this, @edolstra.

As I understood the code, this particular section is building the newIDs list that is then processed (by the last for loop in this same function) to call createBuild which will then try createStep for the build's drvPath but the latter will find the step already existing in the buildsteps table and simply return that without adding to newSteps. The result of that will be a call to propagatePriorities and a message, which I observe periodically in the logs:

Jun 05 00:01:40 ___ hydra-queue-runner[14191]: added build 3836338 (top-level step /nix/store/___.drv, 0 new steps)

I suppose the propagatePriorities call and the log message could be skipped if newSteps.empty(). I can make this change if you agree with this tactic.

Ultimately yes, this causes the queue runner to do some extra work re-checking builds that are already queued until they finish, but I think that cost is relatively low compared to leaving stranded pending builds that never get processed. More efficiency could probably be gained by adding another state to to the Builds table (perhaps to buildStatus), but that would be a considerably more impactful change which I tried to avoid in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On hydra.nixos.org, we can have ~100000 unfinished builds in the database, so that means we would be constantly fetching the same 100000 builds from the database (and processing them). That's a lot of database traffic...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's about 2.5x of our local peak, so I can understand the scale of the your concern, @edolstra . Would you be open to a modification that used a DB field or status update of an existing field to indicate this processing has been completed? Or did you have other suggestions?

@edolstra
Copy link
Member

The real bug here may be that

if (id > newLastBuildId) newLastBuildId = id;

interacts poorly with order by globalPriority desc, id. I.e. the original assumption was that builds get processed in strictly increasing ID order, but then the priority was added which messes that up.

@kquick
Copy link
Contributor Author

kquick commented Jun 11, 2020

I think there would be a bad interaction between globalPriority-based sorting and a local lastBuildId, but I don't think that's the culprit for my situation: there are only a handful of records in my 4 million row Builds table that don't have a globalPriority of 0.

I added an update patch (6c10510) that sets the startTime of the Build record when it is initially processed. These are all initially created as zero, and there is no other query that references startTime that does not also use stopTime (and usually finished), so this should be a viable indicator that the queue-monitor has processed the Build record that doesn't create problems elsewhere. This approach should cause each Build record to be processed only once, and it should also work just fine relative to the globalPriority ordering.

Please let me know what you think of this approach, @edolstra. Thanks!

@kquick
Copy link
Contributor Author

kquick commented Jun 19, 2020

@edolstra this latest change is running locally with good performance and not "lost" builds, so I think its ready for you to take another look and merge if it looks reasonable to you. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants