-
-
Notifications
You must be signed in to change notification settings - Fork 15.5k
nixos/slurm: set slurmd KillMode and add extraConfigPaths #50862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The default of systemd is to kill the the whole cgroup of a service. For slurmd this means that all running jobs get killed as well whenever the configuration is updated (and activated). To avoid this behaviour we set "KillMode=process" to kill only slurmd on reload. This is how slurm configures the systemd service. See: https://bugs.schedmd.com/show_bug.cgi?id=2095#c24 SchedMD/slurm@508f866
@GrahamcOfBorg test slurm |
No attempt on aarch64-linux (full log) The following builds were skipped because they don't evaluate on aarch64-linux: tests.slurm Partial log (click to expand)
|
No attempt on x86_64-linux (full log) The following builds were skipped because they don't evaluate on x86_64-linux: tests.slurm Partial log (click to expand)
|
@GrahamcOfBorg test slurm |
No attempt on aarch64-linux (full log) The following builds were skipped because they don't evaluate on aarch64-linux: tests.slurm Partial log (click to expand)
|
No attempt on x86_64-linux (full log) The following builds were skipped because they don't evaluate on x86_64-linux: tests.slurm Partial log (click to expand)
|
@GrahamcOfBorg build nixosTests.slurm this is a workaround |
No attempt on x86_64-darwin (full log) The following builds were skipped because they don't evaluate on x86_64-darwin: nixosTests.slurm Partial log (click to expand)
|
No attempt on aarch64-linux (full log) The following builds were skipped because they don't evaluate on aarch64-linux: nixosTests.slurm Partial log (click to expand)
|
No attempt on x86_64-linux (full log) The following builds were skipped because they don't evaluate on x86_64-linux: nixosTests.slurm Partial log (click to expand)
|
The evaluation error you see is now legitimate the same test evaluates locally. |
b307969
to
25af518
Compare
Sorry, looks like I forgot to push the latest version. |
@GrahamcOfBorg build nixosTests.slurm |
Failure on x86_64-linux (full log) Attempted: nixosTests.slurm Partial log (click to expand)
|
Unexpected error: unexpected build failure on aarch64-linux (full log) Attempted: nixosTests.slurm Partial log (click to expand)
|
No attempt on x86_64-darwin (full log) The following builds were skipped because they don't evaluate on x86_64-darwin: nixosTests.slurm Partial log (click to expand)
|
This seems to be a timing issue now. The test passes just fine locally. |
No attempt on x86_64-darwin (full log) The following builds were skipped because they don't evaluate on x86_64-darwin: nixosTests.slurm Partial log (click to expand)
|
Failure on x86_64-linux (full log) Attempted: nixosTests.slurm Partial log (click to expand)
|
@GrahamcOfBorg build nixosTests.slurm |
No attempt on x86_64-darwin (full log) The following builds were skipped because they don't evaluate on x86_64-darwin: nixosTests.slurm Partial log (click to expand)
|
Success on aarch64-linux (full log) Attempted: nixosTests.slurm Partial log (click to expand)
|
Failure on x86_64-linux (full log) Attempted: nixosTests.slurm Partial log (click to expand)
|
Success on aarch64-linux (full log) Attempted: nixosTests.slurm Partial log (click to expand)
|
I am not sure how to fix this. The test behaves different on my local (NixOS) machine. |
@GrahamcOfBorg build nixosTests.slurm |
some sort of timing issue? |
This makes tests more reliable. It seems that waitForUnit(slurmdbd.service) is not sufficient on some systems.
It looks like some weird timing issue. The test runs just fine locally. Let's see if waiting for the TCP port to appear helps. |
@GrahamcOfBorg build nixosTests.slurm |
OK, that worked 🎊. The aarch64 failure is now due to a broken libjpeg-turbo. |
If there are no further comments, this PR would be ready to merge. |
Motivation for this change
KillMode
toprocess
: In the standard configuration all jobs are killed when slurmd is restarted (e.g. when the configuration is updated). This is quite disruptive in a production environment. SettingKillMode=process
kills only slurmd but not slurmstepd and thus not the running job. This is also the standard behavior of the systemd files delivered from slurm.extraConfigPaths
: This allows to add custom config files to slurm configuration. All plugins that come with config files need to be in the same directory as slurm.conf.Things done
sandbox
innix.conf
on non-NixOS)nixos/tests/slurm.nix
succeeds.nix-shell -p nox --run "nox-review wip"
./result/bin/
)nix path-info -S
before and after)