pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530

andersk · 2018-04-06T21:28:36Z

Motivation for this change

This upgrades PyTorch and adds CUDA and cuDNN support. It’s an updated version of #32438 “pytorch-0.3 with cuda and cudnn” and includes those commits.

Things done

Tested using sandboxing (nix.useSandbox on NixOS, or option build-use-sandbox in nix.conf on non-NixOS)
Built on platform(s)
- NixOS
- macOS
- other Linux distributions
Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
Tested execution of all binary files (usually in ./result/bin/)
Fits CONTRIBUTING.md.

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

Fixes this error: In file included from /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/host_config.h:50:0, from /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/cuda_runtime.h:78, from <command-line>:0: /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/crt/host_config.h:121:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported! #error -- unsupported GNU version! gcc versions later than 6 are not supported! ^~~~~ Signed-off-by: Anders Kaseorg <andersk@mit.edu>

andersk · 2018-04-06T22:56:55Z

We may yet have to disable tests even for non-CUDA builds, because the “distributed tests for the TCP backend with file init_method” seem to fail in the sandbox with errors like this.

Running distributed tests for the TCP backend with file init_method
Process process 2:
Traceback (most recent call last):
  File "/nix/store/lp7x6ziq5b2zihgad0i8a28xxvz5vnrr-python3-3.6.4/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/nix/store/lp7x6ziq5b2zihgad0i8a28xxvz5vnrr-python3-3.6.4/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "./test_distributed.py", line 551, in _run
    getattr(self, self.id().split(".")[2])()
  File "./test_distributed.py", line 510, in wrapper
    fn(self)
  File "./test_distributed.py", line 462, in test_all_gather
    group, group_id, rank = self._init_global_test()
  File "./test_distributed.py", line 111, in _init_global_test
    group = [i for i in range(0, dist.get_world_size())]
  File "/nix/store/xc4ipwb39f4gwblrkkanb0cpmqxin6zm-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/distributed/__init__.py", line 120, in get_worl
d_size
    assert torch.distributed._initialized
AssertionError

andersk · 2018-04-07T05:42:29Z

I patched out the test in question. Also, I found that the stub libcuda.so from cudatoolkit is sufficient to get the remaining tests running in the cudaSupport case, so I turned that back on.

akamaus · 2018-04-07T18:49:59Z

pkgs/top-level/python-packages.nix

+  };
+
+  pytorchWithoutCuda = self.pytorch.override {
+    cudaSupport = false;


I tried to build the package on a system without cuda and found that setting cudaSupport = false is actualy not enough. This way cuda library dependency still is being built. It disappeared then I set cudatoolkit and cudnn to null explicitly.

I’d forgotten to let-bind cudatoolkit_joined, cudaStub, and cudaStubEnv. Does this work better now?

@andersk looks like I messed with history. I actually tried my old version and stumbled upon issue mentioned here. #32438 (comment)

Still, yours commit 7a851ad fails for me with the message:

nix-shell -p python36Packages.pytorch --show-trace error: while evaluating the attribute ‘buildInputs’ of the derivation ‘shell’ at /opt/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:98:11: while evaluating the attribute ‘patches’ of the derivation ‘python3.6-pytorch-0.3.1’ at /opt/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:98:11: while evaluating anonymous function at /opt/nixpkgs/pkgs/build-support/fetchpatch/default.nix:8:1, called from /opt/nixpkgs/pkgs/development/python-modules/pytorch/default.nix:41:6: anonymous function at /opt/nixpkgs/pkgs/build-support/fetchurl/default.nix:38:1 called with unexpected argument ‘extraPrefix’, at /opt/nixpkgs/pkgs/build-support/fetchpatch/default.nix:10:1

You need to merge with master or release-18.03 or something.

I merged your branch to fresh master at e1dee4e as you suggested. Builds for me now, but fails on testing. I run this on ancient P6200 processor, probably that's the reason.

$ nix-shell -p python36Packages.pytorchWithoutCuda <...> installing /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source/dist /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source Processing ./torch-0.3.1b0-cp36-cp36m-linux_x86_64.whl Requirement already satisfied: numpy in /nix/store/q61pr9l2l31hqrkan0m4zd0wqv85cmnp-python3.6-numpy-1.14.1/lib/python3.6/site-packages (from torch==0.3.1b0) Requirement already satisfied: pyyaml in /nix/store/rp8vvvf15fni2dxn1p46hv6ly3k0xca3-python3.6-PyYAML-3.12/lib/python3.6/site-packages (from torch==0.3.1b0) Installing collected packages: torch Successfully installed torch-0.3.1b0 /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source post-installation fixup shrinking RPATHs of ELF executables and libraries in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1 shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_thnn/_THNN.cpython-36m-x86_64-linux-gnu.so shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTHNN.so.1 shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libshm.so shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTHS.so.1 shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/torch_shm_manager shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libATen.so.1 shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTH.so.1 shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_dl.cpython-36m-x86_64-linux-gnu.so shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so strip is /nix/store/fzcs0fn6bb04m82frhlb78nc03ny3w55-binutils-2.28.1/bin/strip stripping (with command strip and flags -S) in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib patching script interpreter paths in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1 checking for references to /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0 in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1... running install tests /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source/test /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source Running JIT tests s.......s.s.s..s....s.....s......ss......... ---------------------------------------------------------------------- Ran 44 tests in 0.285s OK (skipped=9) Running torch tests ...test/run_test.sh: line 27: 29365 Illegal instruction (core dumped) $PYCMD test_torch.py $@ builder for ‘/nix/store/4bs5qi4jpnslv3aqr8m8ffp19xy1bi9d-python3.6-pytorch-0.3.1.drv’ failed with exit code 132hell -p python36Packages.pytorchWithoutCuda

ok. Builts with and without cuda on a more recent hardware.

What does this section of your build log look like on your P6200?

-- Performing Test C_HAS_SSE1_1 -- Performing Test C_HAS_SSE1_1 - Success -- Performing Test C_HAS_SSE2_1 -- Performing Test C_HAS_SSE2_1 - Success -- Performing Test C_HAS_SSE3_1 -- Performing Test C_HAS_SSE3_1 - Failed -- Performing Test C_HAS_SSE3_2 -- Performing Test C_HAS_SSE3_2 - Success -- Performing Test C_HAS_SSE4_1_1 -- Performing Test C_HAS_SSE4_1_1 - Failed -- Performing Test C_HAS_SSE4_1_2 -- Performing Test C_HAS_SSE4_1_2 - Success -- Performing Test C_HAS_SSE4_2_1 -- Performing Test C_HAS_SSE4_2_1 - Failed -- Performing Test C_HAS_SSE4_2_2 -- Performing Test C_HAS_SSE4_2_2 - Success -- Performing Test C_HAS_AVX_1 -- Performing Test C_HAS_AVX_1 - Failed -- Performing Test C_HAS_AVX_2 -- Performing Test C_HAS_AVX_2 - Success -- Performing Test C_HAS_AVX2_1 -- Performing Test C_HAS_AVX2_1 - Failed -- Performing Test C_HAS_AVX2_2 -- Performing Test C_HAS_AVX2_2 - Success -- Performing Test CXX_HAS_SSE1_1 -- Performing Test CXX_HAS_SSE1_1 - Success -- Performing Test CXX_HAS_SSE2_1 -- Performing Test CXX_HAS_SSE2_1 - Success -- Performing Test CXX_HAS_SSE3_1 -- Performing Test CXX_HAS_SSE3_1 - Failed -- Performing Test CXX_HAS_SSE3_2 -- Performing Test CXX_HAS_SSE3_2 - Success -- Performing Test CXX_HAS_SSE4_1_1 -- Performing Test CXX_HAS_SSE4_1_1 - Failed -- Performing Test CXX_HAS_SSE4_1_2 -- Performing Test CXX_HAS_SSE4_1_2 - Success -- Performing Test CXX_HAS_SSE4_2_1 -- Performing Test CXX_HAS_SSE4_2_1 - Failed -- Performing Test CXX_HAS_SSE4_2_2 -- Performing Test CXX_HAS_SSE4_2_2 - Success -- Performing Test CXX_HAS_AVX_1 -- Performing Test CXX_HAS_AVX_1 - Failed -- Performing Test CXX_HAS_AVX_2 -- Performing Test CXX_HAS_AVX_2 - Success -- Performing Test CXX_HAS_AVX2_1 -- Performing Test CXX_HAS_AVX2_1 - Failed -- Performing Test CXX_HAS_AVX2_2 -- Performing Test CXX_HAS_AVX2_2 - Success -- SSE2 Found -- SSE3 Found -- AVX Found -- AVX2 Found

Should we add sse2Support, sse3Support, avxSupport, avx2Support flags for better determinism?

-- Found OpenMP_CXX: -fopenmp -- Found OpenMP: TRUE -- Compiling with OpenMP support -- Could not find hardware support for NEON on this machine. -- No OMAP3 processor on this machine. -- No OMAP4 processor on this machine. -- Looking for cpuid.h -- Looking for cpuid.h - found -- Performing Test HAVE_GCC_GET_CPUID -- Performing Test HAVE_GCC_GET_CPUID - Success -- Performing Test NO_GCC_EBX_FPIC_BUG -- Performing Test NO_GCC_EBX_FPIC_BUG - Success -- Performing Test C_HAS_SSE1_1 -- Performing Test C_HAS_SSE1_1 - Success -- Performing Test C_HAS_SSE2_1 -- Performing Test C_HAS_SSE2_1 - Success -- Performing Test C_HAS_SSE3_1 -- Performing Test C_HAS_SSE3_1 - Failed -- Performing Test C_HAS_SSE3_2 -- Performing Test C_HAS_SSE3_2 - Success -- Performing Test C_HAS_SSE4_1_1 -- Performing Test C_HAS_SSE4_1_1 - Failed -- Performing Test C_HAS_SSE4_1_2 -- Performing Test C_HAS_SSE4_1_2 - Success -- Performing Test C_HAS_SSE4_2_1 -- Performing Test C_HAS_SSE4_2_1 - Failed -- Performing Test C_HAS_SSE4_2_2 -- Performing Test C_HAS_SSE4_2_2 - Success -- Performing Test C_HAS_AVX_1 -- Performing Test C_HAS_AVX_1 - Failed -- Performing Test C_HAS_AVX_2 -- Performing Test C_HAS_AVX_2 - Success -- Performing Test C_HAS_AVX2_1 -- Performing Test C_HAS_AVX2_1 - Failed -- Performing Test C_HAS_AVX2_2 -- Performing Test C_HAS_AVX2_2 - Success -- Performing Test CXX_HAS_SSE1_1 -- Performing Test CXX_HAS_SSE1_1 - Success -- Performing Test CXX_HAS_SSE2_1 -- Performing Test CXX_HAS_SSE2_1 - Success -- Performing Test CXX_HAS_SSE3_1 -- Performing Test CXX_HAS_SSE3_1 - Failed -- Performing Test CXX_HAS_SSE3_2 -- Performing Test CXX_HAS_SSE3_2 - Success -- Performing Test CXX_HAS_SSE4_1_1 -- Performing Test CXX_HAS_SSE4_1_1 - Failed -- Performing Test CXX_HAS_SSE4_1_2 -- Performing Test CXX_HAS_SSE4_1_2 - Success -- Performing Test CXX_HAS_SSE4_2_1 -- Performing Test CXX_HAS_SSE4_2_1 - Failed -- Performing Test CXX_HAS_SSE4_2_2 -- Performing Test CXX_HAS_SSE4_2_2 - Success -- Performing Test CXX_HAS_AVX_1 -- Performing Test CXX_HAS_AVX_1 - Failed -- Performing Test CXX_HAS_AVX_2 -- Performing Test CXX_HAS_AVX_2 - Success -- Performing Test CXX_HAS_AVX2_1 -- Performing Test CXX_HAS_AVX2_1 - Failed -- Performing Test CXX_HAS_AVX2_2 -- Performing Test CXX_HAS_AVX2_2 - Success -- SSE2 Found -- SSE3 Found -- AVX Found -- AVX2 Found -- Performing Test HAS_C11_ATOMICS -- Performing Test HAS_C11_ATOMICS - Success -- TH_SO_VERSION: 1 -- Atomics: using C11 intrinsics

Similar to NixOS#30058 for TensorFlow. Signed-off-by: Anders Kaseorg <andersk@mit.edu>

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

akamaus · 2018-05-04T08:14:34Z

@FRidh Why don't we merge it? Let's not wait until it no longer applies like my previous pull request. There're some issues on older systems, we see. But package works for majority of users, I suppose. Currently we have 0.2 version in nixpkgs which almost unusable because it lacks cuda support altogether.

FRidh · 2018-05-04T08:29:20Z

Sometimes a PR does not get any attention from any of the maintainers simply because they have not bothered with it. It's important in such cases that the community (say you and @andersk) push for the change, and one of the first steps would be to ask the maintainer of the expression (@teh) to review.

Looking at the change and the discussion, I don't see any clear consensus that it works, and thus it would appear to me as a work-in-progress. Furthermore, as the maintainer has not chimed in on it yet, I won't merge it yet.

teh · 2018-05-04T14:02:23Z

This is a continuation of #32438 which I think was almost ready.

I'm OK with this being merged - I am already running a slight variant of this in our private nixpkgs tree.

akamaus and others added 4 commits December 20, 2017 00:17

pytorch-0.3 with optional cuda and cudnn

83592fd

pytorch tests reenabled if compiling without cuda

f1e1655

pytorch: Conditionalize cudnn dependency on cudaSupport

a7c2429

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

andersk requested a review from FRidh as a code owner April 6, 2018 21:28

andersk mentioned this pull request Apr 6, 2018

pytorch-0.3 with cuda and cudnn #32438

Closed

GrahamcOfBorg added 6.topic: python 8.has: clean-up 8.has: package (new) 10.rebuild-darwin: 0 10.rebuild-linux: 1-10 labels Apr 6, 2018

andersk force-pushed the pytorch branch from 8f8226c to 3dc5ac1 Compare April 7, 2018 05:39

andersk force-pushed the pytorch branch from 3dc5ac1 to 318f292 Compare April 7, 2018 06:43

akamaus reviewed Apr 7, 2018

View reviewed changes

andersk added 5 commits April 7, 2018 15:00

pytorch: Build with joined cudatoolkit

90456d7

Similar to NixOS#30058 for TensorFlow. Signed-off-by: Anders Kaseorg <andersk@mit.edu>

pytorch: 0.3.0 -> 0.3.1

ec4c483

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

pytorch: Patch for “refcounted file mapping not supported” failure

abfcb5d

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

pytorch: Skip distributed tests

bd56890

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

pytorch: Use the stub libcuda.so from cudatoolkit for running tests

7a851ad

Signed-off-by: Anders Kaseorg <andersk@mit.edu>

andersk force-pushed the pytorch branch from 318f292 to 7a851ad Compare April 7, 2018 19:04

FRidh merged commit ce00943 into NixOS:master May 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530

pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530

andersk commented Apr 6, 2018

andersk commented Apr 6, 2018

andersk commented Apr 7, 2018

akamaus Apr 7, 2018

andersk Apr 7, 2018

akamaus Apr 7, 2018

andersk Apr 7, 2018

akamaus Apr 8, 2018

akamaus Apr 8, 2018

andersk Apr 8, 2018

akamaus Apr 8, 2018

akamaus commented May 4, 2018

FRidh commented May 4, 2018

teh commented May 4, 2018

pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530

pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530

Conversation

andersk commented Apr 6, 2018

Motivation for this change

Things done

andersk commented Apr 6, 2018

andersk commented Apr 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akamaus commented May 4, 2018

FRidh commented May 4, 2018

teh commented May 4, 2018