Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/slurm: add slurmdbd, run daemons as user #49348

Merged
merged 5 commits into from Oct 30, 2018

Conversation

markuskowa
Copy link
Member

@markuskowa markuskowa commented Oct 28, 2018

Motivation for this change

One more step to bring Slurm to its full potential on NixOS. The major changes in this PR are to run the slurm controller as normal user (instead of root) and an extension of the module for the database daemon.

The options nodeName and partitionName are now sets of strings instead of string, which reflects that there can be more than one entry for those options.

The slurmctld dump all its files directly /var/spool. This has been fixed by setting the default StateSaveLocation to /var/spool/slurmctld.

CC @veprbl @lsix

Things done
  • Run slurmctld and slurmdbd as user 'slurm' by default (new option services.slurm.user).
  • Added service for slurmdbd (new options services.slurm.dbdserver.[enable,dbdHost,extraConfig])
  • Add test for slurmdbd to nixos/tests/slurm.nix
  • Fix StateSaveLocation (new option services.slurm.stateSaveLocation)
  • Make nodeName and partitionName sets of strings
  • Document all incompatible changes in the release notes
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
    nixos/tests/slurm.nix succeeds
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

@markuskowa
Copy link
Member Author

@GrahamcOfBorg test slurm

@GrahamcOfBorg
Copy link

Success on x86_64-linux (full log)

Attempted: tests.slurm

Partial log (click to expand)

cleaning up
killing node1 (pid 597)
killing node2 (pid 609)
killing submit (pid 621)
killing node3 (pid 632)
killing control (pid 644)
killing dbd (pid 657)
vde_switch: EOF on stdin, cleaning up and exiting
vde_switch: Could not remove ctl dir '/build/vde1.ctl': Directory not empty
/nix/store/dbnh2n7gy2v0wl2mjmvr2yq9vwkaqrdi-vm-test-run-slurm

@GrahamcOfBorg
Copy link

Success on aarch64-linux (full log)

Attempted: tests.slurm

Partial log (click to expand)

cleaning up
killing submit (pid 631)
killing node1 (pid 643)
killing node2 (pid 657)
killing control (pid 669)
killing dbd (pid 683)
killing node3 (pid 695)
vde_switch: EOF on stdin, cleaning up and exiting
vde_switch: Could not remove ctl dir '/build/vde1.ctl': Directory not empty
/nix/store/x9g9jkkibxfwzkwrairwjfp5v33638rj-vm-test-run-slurm

@eliasp
Copy link
Member

eliasp commented Oct 28, 2018

Did you consider making use of systemd's DynamicUser=yes capability?
This would mean:

  • no more maintaining a static user through nixos/modules/misc/ids.nix
  • no more "manually" (as in preStart) creating/owning required directories, but having systemd manage them for you

By using DynamicUser= systemd augments the service's namespace with the required user/group(s) and it will take care of making sure, the permissions to the defined directories are right - additional mode requirements can be specified as well.
Furthermore, using DynamicUser= also implies ProtectSystem=strict which improves the general security.

@markuskowa
Copy link
Member Author

I am note sure if DynamicUser is an option here. All machines in the cluster communicate via munge authentication. That requires to have the same UID for the slurm user across all machines.

@eliasp
Copy link
Member

eliasp commented Oct 28, 2018

Uh, that's a quite ugly design-decision on behalf of munge, but you're right - this might cause some trouble then…

* run as user 'slurm' per default instead of root
* add user/group slurm to ids.nix
* fix default location for the state dir of slurmctld:
  (/var/spool -> /var/spool/slurmctld)
* Update release notes with the above changes
* New options "services.slurm.dbdserver.[enable,config]"
* Add slurmdbd to test slurm.nix
Make the node and partitionname options lists.
There can be more than paratition or set of nodes.

Add changes to release notes
@markuskowa
Copy link
Member Author

@GrahamcOfBorg test slurm

@GrahamcOfBorg
Copy link

Success on x86_64-linux (full log)

Attempted: tests.slurm

Partial log (click to expand)

cleaning up
killing node2 (pid 597)
killing control (pid 608)
killing node1 (pid 621)
killing submit (pid 632)
killing node3 (pid 644)
killing dbd (pid 657)
vde_switch: EOF on stdin, cleaning up and exiting
vde_switch: Could not remove ctl dir '/build/vde1.ctl': Directory not empty
/nix/store/awbasqn491mg72y9nwrb230vkycwqwk8-vm-test-run-slurm

@GrahamcOfBorg
Copy link

Success on aarch64-linux (full log)

Attempted: tests.slurm

Partial log (click to expand)

cleaning up
killing node3 (pid 631)
killing submit (pid 643)
killing dbd (pid 657)
killing node2 (pid 670)
killing node1 (pid 683)
killing control (pid 696)
vde_switch: EOF on stdin, cleaning up and exiting
vde_switch: Could not remove ctl dir '/build/vde1.ctl': Directory not empty
/nix/store/aav6afgw5vz5wqgv58df7i4mr0gdjpi9-vm-test-run-slurm

@xeji xeji merged commit 6efd811 into NixOS:master Oct 30, 2018
@markuskowa markuskowa deleted the mod-slurm-upgrade branch October 31, 2018 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants