Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for build users when none are available #3564

Merged
merged 4 commits into from May 18, 2020

Conversation

balsoft
Copy link
Member

@balsoft balsoft commented May 5, 2020

Solves #1216

Descriptions from there:

It would be nice to have the option of waiting for a build user to become free instead of failing to build.
I'm talking about this error: error: all build users are currently in use; consider creating additional users and adding them to the ‘nixbld’ group
Is there ever a case where failing is the preferred option to waiting? Hopefully waiting can become the default behaviour.

Additional implementation details: when build users are available, the behavior is exactly the same as before (choose the first available user, take a lock on it, use it). However, when no build users are available at the moment, repeat the search every two seconds. Whenever a build user becomes available, choose it and continue as usual. Previous behavior was to error out.

Tested on:

  • NixOS
  • Non-NixOS Linux
  • Darwin
  • other platforms.

Note:
I have tried implementing a more "proper" solution with libfswatch (to only check the userpool when it changes), but for some reason the callback didn't trigger when lock files were released. If you believe that my work on implementing this with libfswatch is of interest, I can publish it.

@domenkozar
Copy link
Member

Thank you a thousand times!

@LnL7
Copy link
Member

LnL7 commented May 5, 2020

I think this also resolves the fact that we make 32 build users in the installer, this was mostly the reason for it.

@balsoft
Copy link
Member Author

balsoft commented May 5, 2020

I think this also resolves the fact that we make 32 build users in the installer, this was mostly the reason for it.

This is my next nixpkgs PR, yes ;)

src/libstore/build.cc Outdated Show resolved Hide resolved
@edolstra
Copy link
Member

edolstra commented May 6, 2020

This is not a good solution because it puts the entire build loop to sleep (since it's single-threaded). So no other builds/substitutions will make any progress until another process releases a build user. In fact, it seems that this will deadlock if this process itself has acquired all build users (since then it can't make progress to release any build users).

For comparison, if we can't acquire a lock on an output path, the DerivationGoal is put to sleep so that other goals can make progress:

    if (!outputLocks.lockPaths(lockFiles, "", false)) {
        worker.waitForAWhile(shared_from_this());
        return;
    }

@balsoft
Copy link
Member Author

balsoft commented May 6, 2020

@edolstra Thanks for the suggestion, I'll implement the same logic.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

Please, re-evaluate if possible.

I have changed the behaviour to match that of output locks. However, now #2069 is even more of an issue.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

My testing methodology, for anyone interested:

  1. Run nix-daemon from this version of nix
  2. Create a file
    test-nix.nix
with import <nixpkgs> {};
let newPackage = n: hello.overrideAttrs (_: { name = "hello-${toString builtins.currentTime}-${toString n}"; });
in buildEnv { name = "env"; ignoreCollisions = true; paths = builtins.genList newPackage 100; }
  1. nix build -f test-nix.nix --max-jobs $BUILDUSERS+3 (to test for deadlock mentioned above)
  2. nix build -f test-nix.nix as many times as needed to run out of build users.

If the last two commands succeed and use up all available build users while not crashing and not using users two times over, my patch is correct.

I'd be very grateful if someone tested this and reported the results on non-NixOS linux and/or Mac.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

@domenkozar I had a temptation to do so here, but I think it merits a separate PR. I'll create one now.

Actually, I think it's slightly more complex than I anticipated. We need to monitor for changes of the reason for waiting.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

I'll come back to #2069 when I have time.

@balsoft
Copy link
Member Author

balsoft commented May 8, 2020

@domenkozar fixed (architecturally) in #3577, will rebase that if/when this PR is merged.

@domenkozar domenkozar requested a review from edolstra May 11, 2020 16:15
@@ -530,6 +532,12 @@ class UserLock
UserLock::UserLock()
{
assert(settings.buildUsersGroup != "");
createDirs(settings.nixStateDir + "/userpool");
/* Mark that user is not enabled by default */
uid = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 seems a risky choice as a sentinel value for UIDs given that 0 == root. So if there's ever any confusion, we might end up doing an operation as/to root. Might be better to use -1 or std::optional<uid_t>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, to ensure that UserLock always represents a valid user, have a builder function that returns std::unique_ptr<UserLock> so it can represent failure to acquire a user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about adding a function that can represent a failure, and it does appear to be a better solution from some perspectives, but I didn't want to make the diff too big (since this is my first PR). Now that it's inevitably going to be massive, I'll wrap the result in optional and make it represent failure.

@@ -1391,6 +1394,30 @@ void DerivationGoal::tryToBuild()
{
trace("trying to build");

/* If `build-users-group' is not empty, then we have to build as
one of the members of that group. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, it will try to acquire a build user even when we're doing a remote build, which is unnecessary. Maybe it's better to add a new state to the state machine, e.g. startBuild, with a transition from tryToBuild to startBuild if we're doing a local build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, this sounds very much like something I didn't even think about.

I'll try to wrap my head around this and implement it next weekend.

Thank you again!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a state and (hopefully) fixed this.

@balsoft
Copy link
Member Author

balsoft commented May 14, 2020

@edolstra I've (hopefully) fixed both of issues you've found, please look at it again when you have time.

The CI seems to be slightly broken (it can't find a builder for macOS).

I've tested this change on my NixOS desktop as explained above.

@edolstra edolstra merged commit 0ed946a into NixOS:master May 18, 2020
@edolstra
Copy link
Member

Thanks!

@osa1
Copy link

osa1 commented Jun 23, 2021

Sorry for reviving and old thread and n00b question, but, what does it mean if I'm still getting "all build users are currently in use" errors today? Does that mean my nix setup is too old and does not have this patch? Or could it be a configuration issue?

@balsoft
Copy link
Member Author

balsoft commented Jun 23, 2021

This patch never became part of the 2.3 (stable) branch, so it's only available on master/2.4pre right now. @edolstra is there any chance to backport this to 2.3?

@osa1
Copy link

osa1 commented Jun 23, 2021

Thanks for the quick response. Do we have a public URL for 2.4pre that I can use? I'm using https://github.com/cachix/install-nix-action which allows specifying install URL (default is https://nixos.org/nix/install).

@balsoft
Copy link
Member Author

balsoft commented Jun 23, 2021

Please see https://github.com/cachix/install-nix-action#usage-with-flakes (except you don't need all the experimental-features bits, just copy install_url and install_options)

osa1 added a commit to dfinity/motoko that referenced this pull request Jun 24, 2021
This version should include NixOS/nix#3564 and
allow us to increase max-jobs. See #2621 for the previous attempt,
reverted in 8365e73.

The install URL is obtained from https://github.com/NixOS/nix/runs/2895674848 (NixOS/nix@323e545)
@balsoft balsoft deleted the wait-for-builders branch June 24, 2021 12:47
osa1 added a commit to dfinity/motoko that referenced this pull request Jun 28, 2021
This version should include NixOS/nix#3564 and
allow us to increase max-jobs. See #2621 for the previous attempt,
reverted in 8365e73.

The install URL is obtained from https://github.com/NixOS/nix/runs/2895674848 (NixOS/nix@323e545)
osa1 added a commit to dfinity/motoko that referenced this pull request Jun 28, 2021
This version should include NixOS/nix#3564 and
allow us to increase max-jobs. See #2621 for the previous attempt,
reverted in 8365e73.

The install URL is obtained from https://github.com/NixOS/nix/runs/2895674848 (NixOS/nix@323e545)
osa1 added a commit to dfinity/motoko that referenced this pull request Jun 29, 2021
This version should include NixOS/nix#3564 and
allow us to increase max-jobs. See #2621 for the previous attempt,
reverted in 8365e73.

The install URL is obtained from https://github.com/NixOS/nix/runs/2895674848 (NixOS/nix@323e545)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants