New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #706: do not use locally cached last build ID. #776
base: master
Are you sure you want to change the base?
Conversation
Relies on database optimizations instead of local lookup optimizations, but resolves the issue where the local ID was being updated prematurely and leaving builds permanently (until the next hydra restart) in the unstarted condition.
{ | ||
pqxx::work txn(conn); | ||
|
||
auto res = txn.exec_params | ||
("select id, project, jobset, job, drvPath, maxsilent, timeout, timestamp, globalPriority, priority from Builds " | ||
"where id > $1 and finished = 0 order by globalPriority desc, id", | ||
lastBuildId); | ||
"where finished = 0 order by globalPriority desc, id"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this cause it to select the same builds over and over (at least until they're finished)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing this, @edolstra.
As I understood the code, this particular section is building the newIDs
list that is then processed (by the last for
loop in this same function) to call createBuild
which will then try createStep
for the build's drvPath
but the latter will find the step already existing in the buildsteps
table and simply return that without adding to newSteps
. The result of that will be a call to propagatePriorities
and a message, which I observe periodically in the logs:
Jun 05 00:01:40 ___ hydra-queue-runner[14191]: added build 3836338 (top-level step /nix/store/___.drv, 0 new steps)
I suppose the propagatePriorities
call and the log message could be skipped if newSteps.empty()
. I can make this change if you agree with this tactic.
Ultimately yes, this causes the queue runner to do some extra work re-checking builds that are already queued until they finish, but I think that cost is relatively low compared to leaving stranded pending builds that never get processed. More efficiency could probably be gained by adding another state to to the Builds table (perhaps to buildStatus
), but that would be a considerably more impactful change which I tried to avoid in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On hydra.nixos.org, we can have ~100000 unfinished builds in the database, so that means we would be constantly fetching the same 100000 builds from the database (and processing them). That's a lot of database traffic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's about 2.5x of our local peak, so I can understand the scale of the your concern, @edolstra . Would you be open to a modification that used a DB field or status update of an existing field to indicate this processing has been completed? Or did you have other suggestions?
The real bug here may be that
interacts poorly with |
I think there would be a bad interaction between I added an update patch (6c10510) that sets the Please let me know what you think of this approach, @edolstra. Thanks! |
@edolstra this latest change is running locally with good performance and not "lost" builds, so I think its ready for you to take another look and merge if it looks reasonable to you. Thanks! |
Relies on database optimizations instead of local lookup
optimizations, but resolves the issue where the local ID was being
updated prematurely and leaving builds permanently (until the next
hydra restart) in the unstarted condition.
Previous to this change, builds were frequently left in a Pending status and required manual starting or a restart of the hydra-queue-runner. After implementing this change, the local hydra instance has run for several months without any builds left Pending.