Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad: add optional nvml support #107030

Merged
merged 1 commit into from Jan 3, 2021
Merged

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Dec 16, 2020

Motivation for this change

The rationale for this change is to support GPU device resources in Nomad to inform its scheduler.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Using this test configuration:

❯ bat -p client.hcl
data_dir = "/var/lib/nomad"

plugin "nvidia-gpu" {
  config {
    enabled = true
  }
}

client {
  enabled = true
}

server {
  enabled = true
  bootstrap_expect = 1
}

then running:

❯ sudo ./result/bin/nomad agent -dev -config=./client.hcl

we see that indeed the plugin is discovered:

==> Nomad agent started! Log data will stream in below:

    2020-12-16T09:19:50.330-0500 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/lib/nomad/plugins
    2020-12-16T09:19:50.331-0500 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/lib/nomad/plugins
    2020-12-16T09:19:50.331-0500 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/var/lib/nomad/plugins
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2020-12-16T09:19:50.331-0500 [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0

Running the examples here locally:

Looking at the basics:

❯ ./result/bin/nomad node status f83f7c5d
...

Device Resource Utilization
nvidia/gpu/GeForce RTX 2080 SUPER[GPU-16efc9cd-6886-5c12-52f8-2bef887747d9]  0 / 7979 MiB

...

More detailed info:

❯ ./result/bin/nomad node status -stats f83f7c5d
...

Device Stats
Device              = nvidia/gpu/GeForce RTX 2080 SUPER[GPU-16efc9cd-6886-5c12-52f8-2bef887747d9]
BAR1 buffer state   = 2 / 256 MiB
Decoder utilization = 0 %
ECC L1 errors       = N/A
ECC L2 errors       = N/A
ECC memory errors   = N/A
Encoder utilization = 0 %
GPU utilization     = 0 %
Memory state        = 0 / 7979 MiB
Memory utilization  = 0 %
Power usage         = 23 / 250 W
Temperature         = 38 C

...

@endocrimes
Copy link
Member

At a first glance this LGTM - thanks! - I never got around to adding this to the nix package bc I don't have an nvidia GPU in my workstation.

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 16, 2020

@endocrimes Thanks for the review! What needs to be done to get this merged?

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 23, 2020

@SuperSandro2000 Can this be merged?

@SuperSandro2000
Copy link
Member

Result of nixpkgs-review pr 107030 run on x86_64-linux 1

2 packages built:
  • nomad (nomad_0_12)
  • nomad_0_11

@SuperSandro2000
Copy link
Member

@SuperSandro2000 Can this be merged?

I am not sure if this is the right approach and breaks things on amd systems.

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 26, 2020

@SuperSandro2000 Can this be merged?

I am not sure if this is the right approach and breaks things on amd systems.

Can you elaborate a bit? The test results I show above come from a machine with an AMD CPU.

@SuperSandro2000
Copy link
Member

Can you elaborate a bit?

I am not confident if this is right and would like that someone else more familiar with this could take a look.

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 29, 2020

@SuperSandro2000 I assumed that @endocrimes's review would be sufficient. I believe they are quite familiar with Nomad.

@SuperSandro2000
Copy link
Member

This is a semi-automatic executed nixpkgs-review which does not build all packages (e.g. lumo, tensorflow or pytorch)
If you find some bugs or got suggestions for further things to search or run please reach out to SuperSandro2000 on IRC.

Result of nixpkgs-review pr 107030 run on x86_64-linux 1

2 packages built:
  • nomad (nomad_0_12)
  • nomad_0_11

@SuperSandro2000 SuperSandro2000 merged commit 20489e3 into NixOS:master Jan 3, 2021
@wentasah
Copy link
Contributor

wentasah commented Jan 3, 2021

Hi, I'm running nixpkgs-review on my local commits for a different package that I want to send as a PR and it seems that this pull request breaks things:

error: --- TypeError ---------------------------------------------------------------------------------------------------------------------------------- nix-env
at: (69:16) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

    68|     let
    69|       result = f origArgs;
      |                ^
    70| 

anonymous function at /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/pkgs/applications/networking/cluster/nomad/generic.nix:1:1 called without required argument 'nvidiaGpuSupport'
------------------------------------------------------------------------- show-trace --------------------------------------------------------------------------
trace: while evaluating 'makeOverridable'
at: (67:24) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

    66|   */
    67|   makeOverridable = f: origArgs:
      |                        ^
    68|     let

trace: from call site
at: (121:8) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

   120|       auto = builtins.intersectAttrs (lib.functionArgs f) autoArgs;
   121|     in makeOverridable f (auto // args);
      |        ^
   122| 

trace: while evaluating 'callPackageWith'
at: (117:35) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

   116|   */
   117|   callPackageWith = autoArgs: fn: args:
      |                                   ^
   118|     let

trace: from call site
at: (3:1) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/pkgs/applications/networking/cluster/nomad/1.0.nix

     2| 
     3| callPackage ./generic.nix {
      | ^
     4|   inherit buildGoPackage;

trace: while evaluating anonymous lambda
at: (1:1) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/pkgs/applications/networking/cluster/nomad/1.0.nix

     1| { callPackage, buildGoPackage }:
      | ^
     2| 

trace: from call site
at: (69:16) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

    68|     let
    69|       result = f origArgs;
      |                ^
    70| 

trace: while evaluating 'makeOverridable'
at: (67:24) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

    66|   */
    67|   makeOverridable = f: origArgs:
      |                        ^
    68|     let

trace: from call site
at: (121:8) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

   120|       auto = builtins.intersectAttrs (lib.functionArgs f) autoArgs;
   121|     in makeOverridable f (auto // args);
      |        ^
   122| 

trace: while evaluating 'callPackageWith'
at: (117:35) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/lib/customisation.nix

   116|   */
   117|   callPackageWith = autoArgs: fn: args:
      |                                   ^
   118|     let

trace: from call site
at: (6140:15) in file: /home/wsh/.cache/nixpkgs-review/rev-27bc58d295719966402a820fdd8f2b2a54bb5b90/nixpkgs/pkgs/top-level/all-packages.nix

  6139|   };
  6140|   nomad_1_0 = callPackage ../applications/networking/cluster/nomad/1.0.nix {
      |               ^
  6141|     buildGoPackage = buildGo115Package;

FRidh added a commit that referenced this pull request Jan 3, 2021
This broke eval.

#107030 (comment)

This reverts commit 20489e3, reversing
changes made to 590feee.
@FRidh
Copy link
Member

FRidh commented Jan 3, 2021

reverted in 6c9b507.

@cpcloud cpcloud mentioned this pull request Jan 3, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants