nvidia-docker module/package #51733

averelld · 2018-12-08T14:43:03Z

Motivation for this change

See #27999 and NixOS/patchelf/issues/44 for some reasons why this is so messy. The ldconfig patch should go in standard glibc, but it's probably better to do that separately later.

Quick test: nvidia-docker run nvidia/cuda:10.0-runtime nvidia-smi

Things done

pkgs/development/libraries/glibc/ldconfig-patchelf.patch

pkgs/applications/virtualization/nvidia-docker/default.nix

pkgs/applications/virtualization/nvidia-docker/libnvc.nix

averelld · 2019-02-08T00:57:23Z

@Mic92 Can we get this closer to merging somehow? I don't think any of the linked issues are likely to show activity soon, and the feature is gated behind an extra switch, so the risk of breaking stuff is low.

pkgs/applications/virtualization/nvidia-docker/default.nix

pkgs/applications/virtualization/nvidia-docker/config.toml

averelld · 2019-02-11T12:23:59Z

@Mic92 Comments addressed, and the branch has been rebased on top of master.

pkgs/top-level/all-packages.nix

pkgs/applications/virtualization/nvidia-docker/libnvc.nix

pkgs/applications/virtualization/nvidia-docker/default.nix

nixos/modules/virtualisation/docker.nix

pkgs/applications/virtualization/nvidia-docker/default.nix

pkgs/applications/virtualization/nvidia-docker/libnvc.nix

averelld · 2019-02-15T18:58:23Z

I realized that there are actually two problems left:

This only works if hardware.opengl.driSupport32Bit is true
The linuxPackages.nvidia_x11 bin reference and copying of it's binaries only works if the default kernel is in use, and not for example _latest.

How can I fix that? Especially accessing the actual kernelPackages from a package is probably not possible. So should I patch binaries in a module and put them in /run or something?

averelld · 2019-02-16T01:53:05Z

So I modified the original nvidia_x11 packages to provide usable binaries which are linked into /run similar to the opengl libraries. Doesn't look very elegant, but at least this way there's less chance of version incompatibilities.
Also I had to fix the patch, because /run/opengl-driver/lib can be a link or a directory depending on configuration.

averelld · 2019-02-19T15:22:48Z

@infinisil This could be re-reviewed, or I could rebase and squash the fixups on top of the current master.

nixos/modules/hardware/video/nvidia.nix

averelld · 2019-02-26T19:10:26Z

@FRidh You can ignore this, I accidentally pushed a commit in the wrong branch.

infinisil

Have tested with the command you gave, working well, thanks! If you want it in the stable release you can make a backport to 19.03.

infinisil · 2019-02-26T23:21:04Z

Oh actually before I merge, you should split the commit into 2: One for adding the package, and one for adding the NixOS option. And format your commit message according to the Contribution Guidelines, also linked to from the PR template.

nvidia_x11 and persistenced were modified to provide binaries which can be mounted inside a docker-container to be executed there. most ldconfig-based discovery of bundled nvidia libraries is patched out ldconfig itself is patched to be able to deal with patchelf'ed libraries See https://sourceware.org/bugzilla/show_bug.cgi?id=23964

averelld · 2019-02-27T09:13:53Z

Awesome. I'll make a backport PR once this is merged.

CMCDragonkai · 2019-03-28T03:07:48Z

If I want run a container requiring GPU on NixOS with this MR, do I have still have to pass these parameters into the container?

  --env LD_LIBRARY_PATH='/lib/cuda' \
  --volume='/nix/store/gc779g2i2046prsxby8k1v92b0nfs2ix-nvidia-x11-390.67-4.17.6/lib:/lib/cuda:ro' \

I currently do this to ensure that the libcuda.so is available inside the container.

averelld · 2019-03-28T06:55:43Z

Do you have an example image and command to test this?
I don't do that usually, but I also don't think libcuda is among the libs mounted into the container, including other distributions. If it turns out to be necessary we might adjust the runner accordingly.

CMCDragonkai · 2019-03-28T08:09:18Z

Not with me, but when I created images using tensorflowWithCuda in 18.09, I would then run docker run with those parameters in order to make it work. If not, TF would complain of mismatched library versions. Basically the host libcuda.so had to be mounted into the container. However I thought using the new nvidia runtime would mean that it would automatically figure out to do this, since I didn't see any examples of those parameters being used by NVIDIA documentation.

averelld · 2019-03-28T08:53:16Z

Ok, I checked the list of auto-mounted libs again: https://github.com/NVIDIA/libnvidia-container/blob/773b1954446b73921ce16919248c764ff62d29ad/src/nvc_info.c#L73
So yes, libcuda works without any special parameters. In general at least the "release" tensorflow images should work out of the box, for example like this:
nvidia-docker run tensorflow/tensorflow:1.13.1-gpu-py3-jupyter python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" or see #51733 (comment) for alternate syntax.

CMCDragonkai · 2019-03-29T03:24:10Z

@averelld I'll be trying that right now. Although I want to know that by enableNvidia = true; this means the default runtime for ALL docker containers (even ones that are not using GPUs) will use nvidia's runtime. Is this recommended? My reading of nvidia's runtime is that it actually is just a "modified" version of runc, whatever that means.

CMCDragonkai · 2019-03-29T07:12:38Z

I just tried it now:

nvidia-docker run -it --rm \
  -p 0.0.0.0:55555:55555 \
  --expose 55555 \
  --volume='/tmp/image-classifier-zo8qv901/output:/settings:ro' \
  --env INVERTERS_API_HOST='0.0.0.0' \
  --env INVERTERS_API_PORT='55555' \
  --env INVERTERS_API_VERSION=0 \
  --env INVERTERS_API_WEIGHTS='/settings/weights.h5' \
  --env INVERTERS_API_CLASSES='/settings/classes.csv' \
  --env INVERTERS_API_STATS='/settings/images_redirected.json:/settings/thresholded.json:/settings/balanced.json:/settings/augmented.json' \
  --env INVERTERS_API_HISTORY='/settings/history.json' \
  --env INVERTERS_API_PARAMETERS='/settings/parameters.json' \
  --env NVIDIA_DRIVER_CAPABILITIES='compute,utility' \
  --env NVIDIA_VISIBLE_DEVICES=all \
  image-classifier-gpu:0gldxpli7c7x4rp1drknb56hsdiggwhx \
  /bin/image-classifier-server -v

Using TensorFlow backend.
2019-03-29 07:01:23.309784: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-29 07:01:23.315024: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2019-03-29 07:01:23.315063: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 303eb4ca0dd6
2019-03-29 07:01:23.315078: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 303eb4ca0dd6
2019-03-29 07:01:23.315166: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 390.87.0
2019-03-29 07:01:23.315220: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 415.27.0
2019-03-29 07:01:23.315237: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 415.27.0 does not match DSO version 390.87.0 -- cannot find working devices in this configuration
Traceback (most recent call last):
  File "/nix/store/3x44q4zrqqhkb2bd4zvrdijzzgic9azq-image-classifier-0.0.1/bin/.image-classifier-server-wrapped", line 15, in <module>
    from image_classifier_api import app
  File "/nix/store/3x44q4zrqqhkb2bd4zvrdijzzgic9azq-image-classifier-0.0.1/lib/python3.6/site-packages/image_classifier_api/__init__.py", line 21, in <module>
    os.environ.get(env_prefix + 'GPU_FULL', 'false') == 'true')
  File "/nix/store/3x44q4zrqqhkb2bd4zvrdijzzgic9azq-image-classifier-0.0.1/lib/python3.6/site-packages/image_classifier/utils.py", line 33, in setup_gpu
    raise OSError('No GPUs available!')
OSError: No GPUs available!

It seems that there's still a problem, as that DSO error is exactly what I got before when I didn't mount the host's cuda driver into it.

I had to add these in:

  --volume='/nix/store/5lp4yw24xphjv4mafph7x61skdwcrxc4-nvidia-x11-415.27-4.19.31/lib:/lib/cuda:ro' \
  --env LD_LIBRARY_PATH='/lib/cuda' \

To ensure that the right libcuda.so gets mounted into the container. It looks like the nvidia runtime does figure out how to mount the correct devices, but then doesn't get the right libcuda.so.

CMCDragonkai · 2019-03-29T07:22:38Z

Just thinking about this... this might be a nix specific issue, as the image that I'm trying to run used dockerTools.buildImage function which packages the libcuda.so into the container. Perhaps the nvidia runtime should be mounting the host's libcuda.so and writing the correct LD_LIBRARY_PATH to avoid this problem?

averelld · 2019-03-29T08:15:07Z

Regarding the runtime, that is only selected when using the nvidia-docker wrapper or explicitly. The enable flag just prepares mountable libraries like libcuda and such.
You can check mounts in your container to see where libcuda is. Much of libnvidia-container relies on ldconfig which doesn't work with nix libraries, but adjusting LD_LIBRARY_PATH might be enough.
In the case of the error above, you can see it found an old libcuda, that probably wasn't mounted from the host, so there might just be an older version baked into the image somewhere.

CMCDragonkai · 2019-03-29T10:01:45Z

Yes it is definitely the older version baked into the image. That's how dockerTools.buildImage works it packages up the nix closure into the container. However you seem you be saying that the nvidia runtime would somehow mount the right libcuda.so. Then perhaps if I know where it is mounted in the local container (or how to access it), I wouldn't need to separately mount the correct libcuda.so from the host.

averelld · 2019-03-29T11:00:04Z

It does, yes, along with a selection of nvidia libs from /run/opengl-driver{,-32}/lib/ and binaries like nvidia-smi. It could be here: https://github.com/NVIDIA/libnvidia-container/blob/e3a2035da5a44b8a83d9568b91a8a0b542ee15d5/src/common.h#L36 but you can just do mount | grep libcuda in the container to find the location. The runtime will also add symlinks, like so:

root@4e36fb6193d9:/# ll /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root       12 Mar 29 10:40 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1*
lrwxrwxrwx 1 root root       17 Mar 29 10:40 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.418.43*
-rw-r--r-- 1 root root 15623896 Sep 12  2018 /usr/lib/x86_64-linux-gnu/libcuda.so.410.48
-r-xr-xr-x 1 root root 16033504 Jan  1  1970 /usr/lib/x86_64-linux-gnu/libcuda.so.418.43*

The ones in /run/ are the libs you would mount anyway, the ones in the nix store could be the wrong version or compiled for the wrong kernel.

CMCDragonkai · 2019-04-01T01:12:42Z

I found it. The NVIDIA docker on NixOS puts libcuda.so into /usr/lib64.

When I used bash to enter into my container, the only place that had libraries was /usr/lib64 there was no other things like /usr/lib/x86_64-linux-gnu, there was not /usr/lib and there was no /run either. I think at minimum the NVIDIA docker container only produces the /usr/lib64. I wonder how your container has /run/opengl-driver... and /usr/lib/x86_64-linux-gnu. Note that my container image is built from dockerTools.buildImage which builds a container from scratch, not from an OS distro.

CMCDragonkai · 2019-04-01T02:27:13Z

@averelld The above solution now works for my tensorflow container. However I wanted to do a little test on a basic trivial container to see what the nvidia-docker integration does step by step. I ended up hitting an error that is unexpected.

# first I build a trivial container containing only bash and coreutils
docker load --input "$(
  nix-build \
    --no-out-link \
    -E 'with import <nixpkgs> {}; dockerTools.buildImage { name = "bash"; tag = "latest"; contents = [ bash coreutils ]; }'
)"

# then I run it with --runtime=nvidia (same result with nvidia-docker ...)
docker run --runtime=nvidia -it --rm \
  --env LD_LIBRARY_PATH='/usr/lib64' \
  --env NVIDIA_DRIVER_CAPABILITIES='compute' \
  --env NVIDIA_VISIBLE_DEVICES=all \
  bash:latest \
  /bin/bash

The result unfortunately is this error:

/nix/store/5lc85iln575fb8xwaxlxmp9z8jsd321q-docker-18.09.2/libexec/docker/docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/nix/store/rzzc02akgdkq3fl94vmll0fqyg92195s-nvidia-docker-2.0.3/bin/nvidia-container-cli --load-kmods --ldcache=/tmp/ld.so.cache configure --ldconfig=@/nix/store/s9gx039ip2kycam82hzfy1i68kl906a1-glibc-2.27-bin/bin/ldconfig --device=all --compute --pid=15632 /var/lib/docker/devicemapper/mnt/d47e4b7194eac442d6ff4c232301466ba8bd432ff2e7f7d4580881c12371744f/rootfs]\\\\nnvidia-container-cli: ldcache error: process /nix/store/s9gx039ip2kycam82hzfy1i68kl906a1-glibc-2.27-bin/bin/ldconfig failed with error code: 1\\\\n\\\"\"": unknown.

averelld · 2019-04-01T11:22:25Z

Regarding the paths above, those paths are image dependent, there is no "NVIDIA docker on NixOS" path unfortunately, and the /run/opengl... paths, are host paths from which the libraries are mounted.

The run command above apparently fails because there is no /tmp, which ldconfig writes to. If you add an appropriate mkdir to buildImage, like with runAsRoot it works.

CMCDragonkai · 2019-04-01T11:23:42Z

Requires /tmp got it.

tbenst · 2020-01-25T01:34:38Z

@CMCDragonkai do you have an example of using buildImage so that it works with nvidia-docker on NixOS? I just tried building

buildPytorch.nix:

let
  pkgs = import ./nixpkgs.nix;
  python-env = pkgs.python3.buildEnv.override {
    extraLibs = with pkgs.python3Packages; [
      pytorch
    ];
  };

in
pkgs.dockerTools.buildImage {
  name = "pytorch"; 
  contents = python-env; 
  runAsRoot = '' 
    #!${pkgs.runtimeShell}
    mkdir -p /data
  '';
  config = { 

    Cmd = [ "/bin/python" ];
    WorkingDir = "/data";
    Volumes = {
      "/data" = {};
    };
  };
}

but no GPU available
other files: https://gist.github.com/tbenst/a2ceffd8f5b2e5c646b4ca3516e7db60

averelld requested a review from infinisil as a code owner December 8, 2018 14:43

GrahamcOfBorg added 6.topic: nixos 8.has: documentation 8.has: module (update) 8.has: package (new) 10.rebuild-darwin: 0 10.rebuild-linux: 1-10 labels Dec 8, 2018

averelld force-pushed the nvidia-docker branch from d44a939 to 16dceb7 Compare December 8, 2018 16:57

Mic92 reviewed Dec 8, 2018

View reviewed changes

pkgs/development/libraries/glibc/ldconfig-patchelf.patch Outdated Show resolved Hide resolved

Mic92 reviewed Dec 9, 2018

View reviewed changes

pkgs/applications/virtualization/nvidia-docker/default.nix Outdated Show resolved Hide resolved

Mic92 reviewed Dec 9, 2018

View reviewed changes

pkgs/applications/virtualization/nvidia-docker/default.nix Outdated Show resolved Hide resolved

Mic92 reviewed Dec 9, 2018

View reviewed changes

pkgs/applications/virtualization/nvidia-docker/libnvc.nix Outdated Show resolved Hide resolved

averelld force-pushed the nvidia-docker branch from 16dceb7 to 35aea63 Compare December 9, 2018 14:01

averelld force-pushed the nvidia-docker branch from 35aea63 to d90450a Compare January 13, 2019 15:54

Mic92 reviewed Feb 8, 2019

View reviewed changes

pkgs/applications/virtualization/nvidia-docker/default.nix Show resolved Hide resolved

Mic92 reviewed Feb 8, 2019

View reviewed changes

pkgs/applications/virtualization/nvidia-docker/config.toml Show resolved Hide resolved

averelld force-pushed the nvidia-docker branch from d90450a to f38f291 Compare February 8, 2019 23:47

averelld force-pushed the nvidia-docker branch from f38f291 to faa1acf Compare February 12, 2019 19:23

infinisil requested changes Feb 14, 2019

View reviewed changes

averelld force-pushed the nvidia-docker branch from faa1acf to cd288aa Compare February 14, 2019 21:16

infinisil requested changes Feb 15, 2019

View reviewed changes

averelld force-pushed the nvidia-docker branch 2 times, most recently from 08878ea to f8c300b Compare February 15, 2019 18:53

Mic92 reviewed Feb 21, 2019

View reviewed changes

nixos/modules/hardware/video/nvidia.nix Outdated Show resolved Hide resolved

GrahamcOfBorg added the 10.rebuild-linux: 101-500 label Feb 21, 2019

averelld force-pushed the nvidia-docker branch from 47a6c26 to cd0f7be Compare February 26, 2019 19:07

infinisil approved these changes Feb 26, 2019

View reviewed changes

Averell Dalton added 2 commits February 27, 2019 09:56

nixos/docker: add enableNvidia option

7f7209e

averelld force-pushed the nvidia-docker branch from cd0f7be to 7f7209e Compare February 27, 2019 08:56

infinisil merged commit fd90158 into NixOS:master Feb 27, 2019

averelld deleted the nvidia-docker branch February 27, 2019 15:44

averelld mentioned this pull request Feb 27, 2019

nvidia-docker 19.03 #56466

Merged

10 tasks

This was referenced Mar 24, 2019

Unable to package nvidia-docker tool for portable distribution #27999

Open

nvidia-docker: Unknown runtime specified nvidia KiaraGrouwstra/nix-config#101

Closed

tbenst mentioned this pull request Jan 29, 2020

dockertools.buildImage: better support for nvidia-docker containers #78739

Open

tomberek mentioned this pull request Apr 25, 2020

[20.03] nvidia-docker: failure to find libnvidia-ml.so on LD_LIBRARY_PATH #83713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-docker module/package #51733

nvidia-docker module/package #51733

averelld commented Dec 8, 2018

averelld commented Feb 8, 2019

averelld commented Feb 11, 2019

averelld commented Feb 15, 2019 •

edited

averelld commented Feb 16, 2019

averelld commented Feb 19, 2019

averelld commented Feb 26, 2019

infinisil left a comment

infinisil commented Feb 26, 2019

averelld commented Feb 27, 2019

CMCDragonkai commented Mar 28, 2019

averelld commented Mar 28, 2019

CMCDragonkai commented Mar 28, 2019

averelld commented Mar 28, 2019

CMCDragonkai commented Mar 29, 2019

CMCDragonkai commented Mar 29, 2019 •

edited

CMCDragonkai commented Mar 29, 2019

averelld commented Mar 29, 2019

CMCDragonkai commented Mar 29, 2019

averelld commented Mar 29, 2019

CMCDragonkai commented Apr 1, 2019

CMCDragonkai commented Apr 1, 2019

averelld commented Apr 1, 2019

CMCDragonkai commented Apr 1, 2019

tbenst commented Jan 25, 2020

nvidia-docker module/package #51733

nvidia-docker module/package #51733

Conversation

averelld commented Dec 8, 2018

Motivation for this change

Things done

averelld commented Feb 8, 2019

averelld commented Feb 11, 2019

averelld commented Feb 15, 2019 • edited

averelld commented Feb 16, 2019

averelld commented Feb 19, 2019

averelld commented Feb 26, 2019

infinisil left a comment

Choose a reason for hiding this comment

infinisil commented Feb 26, 2019

averelld commented Feb 27, 2019

CMCDragonkai commented Mar 28, 2019

averelld commented Mar 28, 2019

CMCDragonkai commented Mar 28, 2019

averelld commented Mar 28, 2019

CMCDragonkai commented Mar 29, 2019

CMCDragonkai commented Mar 29, 2019 • edited

CMCDragonkai commented Mar 29, 2019

averelld commented Mar 29, 2019

CMCDragonkai commented Mar 29, 2019

averelld commented Mar 29, 2019

CMCDragonkai commented Apr 1, 2019

CMCDragonkai commented Apr 1, 2019

averelld commented Apr 1, 2019

CMCDragonkai commented Apr 1, 2019

tbenst commented Jan 25, 2020

averelld commented Feb 15, 2019 •

edited

CMCDragonkai commented Mar 29, 2019 •

edited