Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-docker module/package #51733

Merged
merged 2 commits into from Feb 27, 2019
Merged

nvidia-docker module/package #51733

merged 2 commits into from Feb 27, 2019

Conversation

averelld
Copy link
Contributor

@averelld averelld commented Dec 8, 2018

Motivation for this change

See #27999 and NixOS/patchelf/issues/44 for some reasons why this is so messy. The ldconfig patch should go in standard glibc, but it's probably better to do that separately later.

Quick test: nvidia-docker run nvidia/cuda:10.0-runtime nvidia-smi

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@averelld
Copy link
Contributor Author

averelld commented Feb 8, 2019

@Mic92 Can we get this closer to merging somehow? I don't think any of the linked issues are likely to show activity soon, and the feature is gated behind an extra switch, so the risk of breaking stuff is low.

@averelld
Copy link
Contributor Author

@Mic92 Comments addressed, and the branch has been rebased on top of master.

pkgs/applications/virtualization/nvidia-docker/default.nix Outdated Show resolved Hide resolved
pkgs/applications/virtualization/nvidia-docker/default.nix Outdated Show resolved Hide resolved
nixos/modules/virtualisation/docker.nix Outdated Show resolved Hide resolved
nixos/modules/virtualisation/docker.nix Outdated Show resolved Hide resolved
pkgs/applications/virtualization/nvidia-docker/default.nix Outdated Show resolved Hide resolved
pkgs/applications/virtualization/nvidia-docker/libnvc.nix Outdated Show resolved Hide resolved
pkgs/applications/virtualization/nvidia-docker/libnvc.nix Outdated Show resolved Hide resolved
pkgs/applications/virtualization/nvidia-docker/libnvc.nix Outdated Show resolved Hide resolved
@averelld averelld force-pushed the nvidia-docker branch 2 times, most recently from 08878ea to f8c300b Compare February 15, 2019 18:53
@averelld
Copy link
Contributor Author

averelld commented Feb 15, 2019

I realized that there are actually two problems left:

  • This only works if hardware.opengl.driSupport32Bit is true
  • The linuxPackages.nvidia_x11 bin reference and copying of it's binaries only works if the default kernel is in use, and not for example _latest.

How can I fix that? Especially accessing the actual kernelPackages from a package is probably not possible. So should I patch binaries in a module and put them in /run or something?

@averelld
Copy link
Contributor Author

So I modified the original nvidia_x11 packages to provide usable binaries which are linked into /run similar to the opengl libraries. Doesn't look very elegant, but at least this way there's less chance of version incompatibilities.
Also I had to fix the patch, because /run/opengl-driver/lib can be a link or a directory depending on configuration.

@averelld
Copy link
Contributor Author

@infinisil This could be re-reviewed, or I could rebase and squash the fixups on top of the current master.

@averelld
Copy link
Contributor Author

@FRidh You can ignore this, I accidentally pushed a commit in the wrong branch.

Copy link
Member

@infinisil infinisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tested with the command you gave, working well, thanks! If you want it in the stable release you can make a backport to 19.03.

@infinisil
Copy link
Member

Oh actually before I merge, you should split the commit into 2: One for adding the package, and one for adding the NixOS option. And format your commit message according to the Contribution Guidelines, also linked to from the PR template.

Averell Dalton added 2 commits February 27, 2019 09:56
nvidia_x11 and persistenced were modified to provide binaries which can be
mounted inside a docker-container to be executed there.

most ldconfig-based discovery of bundled nvidia libraries is patched out

ldconfig itself is patched to be able to deal with patchelf'ed libraries
See https://sourceware.org/bugzilla/show_bug.cgi?id=23964
@averelld
Copy link
Contributor Author

Awesome. I'll make a backport PR once this is merged.

@CMCDragonkai
Copy link
Member

If I want run a container requiring GPU on NixOS with this MR, do I have still have to pass these parameters into the container?

  --env LD_LIBRARY_PATH='/lib/cuda' \
  --volume='/nix/store/gc779g2i2046prsxby8k1v92b0nfs2ix-nvidia-x11-390.67-4.17.6/lib:/lib/cuda:ro' \

I currently do this to ensure that the libcuda.so is available inside the container.

@averelld
Copy link
Contributor Author

Do you have an example image and command to test this?
I don't do that usually, but I also don't think libcuda is among the libs mounted into the container, including other distributions. If it turns out to be necessary we might adjust the runner accordingly.

@CMCDragonkai
Copy link
Member

Not with me, but when I created images using tensorflowWithCuda in 18.09, I would then run docker run with those parameters in order to make it work. If not, TF would complain of mismatched library versions. Basically the host libcuda.so had to be mounted into the container. However I thought using the new nvidia runtime would mean that it would automatically figure out to do this, since I didn't see any examples of those parameters being used by NVIDIA documentation.

@averelld
Copy link
Contributor Author

Ok, I checked the list of auto-mounted libs again: https://github.com/NVIDIA/libnvidia-container/blob/773b1954446b73921ce16919248c764ff62d29ad/src/nvc_info.c#L73
So yes, libcuda works without any special parameters. In general at least the "release" tensorflow images should work out of the box, for example like this:
nvidia-docker run tensorflow/tensorflow:1.13.1-gpu-py3-jupyter python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" or see #51733 (comment) for alternate syntax.

@CMCDragonkai
Copy link
Member

@averelld I'll be trying that right now. Although I want to know that by enableNvidia = true; this means the default runtime for ALL docker containers (even ones that are not using GPUs) will use nvidia's runtime. Is this recommended? My reading of nvidia's runtime is that it actually is just a "modified" version of runc, whatever that means.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Mar 29, 2019

I just tried it now:

nvidia-docker run -it --rm \
  -p 0.0.0.0:55555:55555 \
  --expose 55555 \
  --volume='/tmp/image-classifier-zo8qv901/output:/settings:ro' \
  --env INVERTERS_API_HOST='0.0.0.0' \
  --env INVERTERS_API_PORT='55555' \
  --env INVERTERS_API_VERSION=0 \
  --env INVERTERS_API_WEIGHTS='/settings/weights.h5' \
  --env INVERTERS_API_CLASSES='/settings/classes.csv' \
  --env INVERTERS_API_STATS='/settings/images_redirected.json:/settings/thresholded.json:/settings/balanced.json:/settings/augmented.json' \
  --env INVERTERS_API_HISTORY='/settings/history.json' \
  --env INVERTERS_API_PARAMETERS='/settings/parameters.json' \
  --env NVIDIA_DRIVER_CAPABILITIES='compute,utility' \
  --env NVIDIA_VISIBLE_DEVICES=all \
  image-classifier-gpu:0gldxpli7c7x4rp1drknb56hsdiggwhx \
  /bin/image-classifier-server -v

Using TensorFlow backend.
2019-03-29 07:01:23.309784: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-29 07:01:23.315024: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2019-03-29 07:01:23.315063: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 303eb4ca0dd6
2019-03-29 07:01:23.315078: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 303eb4ca0dd6
2019-03-29 07:01:23.315166: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 390.87.0
2019-03-29 07:01:23.315220: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 415.27.0
2019-03-29 07:01:23.315237: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 415.27.0 does not match DSO version 390.87.0 -- cannot find working devices in this configuration
Traceback (most recent call last):
  File "/nix/store/3x44q4zrqqhkb2bd4zvrdijzzgic9azq-image-classifier-0.0.1/bin/.image-classifier-server-wrapped", line 15, in <module>
    from image_classifier_api import app
  File "/nix/store/3x44q4zrqqhkb2bd4zvrdijzzgic9azq-image-classifier-0.0.1/lib/python3.6/site-packages/image_classifier_api/__init__.py", line 21, in <module>
    os.environ.get(env_prefix + 'GPU_FULL', 'false') == 'true')
  File "/nix/store/3x44q4zrqqhkb2bd4zvrdijzzgic9azq-image-classifier-0.0.1/lib/python3.6/site-packages/image_classifier/utils.py", line 33, in setup_gpu
    raise OSError('No GPUs available!')
OSError: No GPUs available!

It seems that there's still a problem, as that DSO error is exactly what I got before when I didn't mount the host's cuda driver into it.

I had to add these in:

  --volume='/nix/store/5lp4yw24xphjv4mafph7x61skdwcrxc4-nvidia-x11-415.27-4.19.31/lib:/lib/cuda:ro' \
  --env LD_LIBRARY_PATH='/lib/cuda' \

To ensure that the right libcuda.so gets mounted into the container. It looks like the nvidia runtime does figure out how to mount the correct devices, but then doesn't get the right libcuda.so.

@CMCDragonkai
Copy link
Member

Just thinking about this... this might be a nix specific issue, as the image that I'm trying to run used dockerTools.buildImage function which packages the libcuda.so into the container. Perhaps the nvidia runtime should be mounting the host's libcuda.so and writing the correct LD_LIBRARY_PATH to avoid this problem?

@averelld
Copy link
Contributor Author

Regarding the runtime, that is only selected when using the nvidia-docker wrapper or explicitly. The enable flag just prepares mountable libraries like libcuda and such.
You can check mounts in your container to see where libcuda is. Much of libnvidia-container relies on ldconfig which doesn't work with nix libraries, but adjusting LD_LIBRARY_PATH might be enough.
In the case of the error above, you can see it found an old libcuda, that probably wasn't mounted from the host, so there might just be an older version baked into the image somewhere.

@CMCDragonkai
Copy link
Member

Yes it is definitely the older version baked into the image. That's how dockerTools.buildImage works it packages up the nix closure into the container. However you seem you be saying that the nvidia runtime would somehow mount the right libcuda.so. Then perhaps if I know where it is mounted in the local container (or how to access it), I wouldn't need to separately mount the correct libcuda.so from the host.

@averelld
Copy link
Contributor Author

It does, yes, along with a selection of nvidia libs from /run/opengl-driver{,-32}/lib/ and binaries like nvidia-smi. It could be here: https://github.com/NVIDIA/libnvidia-container/blob/e3a2035da5a44b8a83d9568b91a8a0b542ee15d5/src/common.h#L36 but you can just do mount | grep libcuda in the container to find the location. The runtime will also add symlinks, like so:

root@4e36fb6193d9:/# ll /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root       12 Mar 29 10:40 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1*
lrwxrwxrwx 1 root root       17 Mar 29 10:40 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.418.43*
-rw-r--r-- 1 root root 15623896 Sep 12  2018 /usr/lib/x86_64-linux-gnu/libcuda.so.410.48
-r-xr-xr-x 1 root root 16033504 Jan  1  1970 /usr/lib/x86_64-linux-gnu/libcuda.so.418.43*

The ones in /run/ are the libs you would mount anyway, the ones in the nix store could be the wrong version or compiled for the wrong kernel.

@CMCDragonkai
Copy link
Member

I found it. The NVIDIA docker on NixOS puts libcuda.so into /usr/lib64.

When I used bash to enter into my container, the only place that had libraries was /usr/lib64 there was no other things like /usr/lib/x86_64-linux-gnu, there was not /usr/lib and there was no /run either. I think at minimum the NVIDIA docker container only produces the /usr/lib64. I wonder how your container has /run/opengl-driver... and /usr/lib/x86_64-linux-gnu. Note that my container image is built from dockerTools.buildImage which builds a container from scratch, not from an OS distro.

@CMCDragonkai
Copy link
Member

@averelld The above solution now works for my tensorflow container. However I wanted to do a little test on a basic trivial container to see what the nvidia-docker integration does step by step. I ended up hitting an error that is unexpected.

# first I build a trivial container containing only bash and coreutils
docker load --input "$(
  nix-build \
    --no-out-link \
    -E 'with import <nixpkgs> {}; dockerTools.buildImage { name = "bash"; tag = "latest"; contents = [ bash coreutils ]; }'
)"

# then I run it with --runtime=nvidia (same result with nvidia-docker ...)
docker run --runtime=nvidia -it --rm \
  --env LD_LIBRARY_PATH='/usr/lib64' \
  --env NVIDIA_DRIVER_CAPABILITIES='compute' \
  --env NVIDIA_VISIBLE_DEVICES=all \
  bash:latest \
  /bin/bash

The result unfortunately is this error:

/nix/store/5lc85iln575fb8xwaxlxmp9z8jsd321q-docker-18.09.2/libexec/docker/docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/nix/store/rzzc02akgdkq3fl94vmll0fqyg92195s-nvidia-docker-2.0.3/bin/nvidia-container-cli --load-kmods --ldcache=/tmp/ld.so.cache configure --ldconfig=@/nix/store/s9gx039ip2kycam82hzfy1i68kl906a1-glibc-2.27-bin/bin/ldconfig --device=all --compute --pid=15632 /var/lib/docker/devicemapper/mnt/d47e4b7194eac442d6ff4c232301466ba8bd432ff2e7f7d4580881c12371744f/rootfs]\\\\nnvidia-container-cli: ldcache error: process /nix/store/s9gx039ip2kycam82hzfy1i68kl906a1-glibc-2.27-bin/bin/ldconfig failed with error code: 1\\\\n\\\"\"": unknown.

@averelld
Copy link
Contributor Author

averelld commented Apr 1, 2019

Regarding the paths above, those paths are image dependent, there is no "NVIDIA docker on NixOS" path unfortunately, and the /run/opengl... paths, are host paths from which the libraries are mounted.

The run command above apparently fails because there is no /tmp, which ldconfig writes to. If you add an appropriate mkdir to buildImage, like with runAsRoot it works.

@CMCDragonkai
Copy link
Member

Requires /tmp got it.

@tbenst
Copy link
Contributor

tbenst commented Jan 25, 2020

@CMCDragonkai do you have an example of using buildImage so that it works with nvidia-docker on NixOS? I just tried building

buildPytorch.nix:

let
  pkgs = import ./nixpkgs.nix;
  python-env = pkgs.python3.buildEnv.override {
    extraLibs = with pkgs.python3Packages; [
      pytorch
    ];
  };

in
pkgs.dockerTools.buildImage {
  name = "pytorch"; 
  contents = python-env; 
  runAsRoot = '' 
    #!${pkgs.runtimeShell}
    mkdir -p /data
  '';
  config = { 

    Cmd = [ "/bin/python" ];
    WorkingDir = "/data";
    Volumes = {
      "/data" = {};
    };
  };
}

but no GPU available
other files: https://gist.github.com/tbenst/a2ceffd8f5b2e5c646b4ca3516e7db60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants