Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530

Merged
merged 9 commits into from May 4, 2018
Merged

Conversation

andersk
Copy link
Contributor

@andersk andersk commented Apr 6, 2018

Motivation for this change

This upgrades PyTorch and adds CUDA and cuDNN support. It’s an updated version of #32438 “pytorch-0.3 with cuda and cudnn” and includes those commits.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option build-use-sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Fits CONTRIBUTING.md.

akamaus and others added 4 commits December 20, 2017 00:17
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Fixes this error:

In file included from /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/host_config.h:50:0,
                 from /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/cuda_runtime.h:78,
                 from <command-line>:0:
/nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/crt/host_config.h:121:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
 #error -- unsupported GNU version! gcc versions later than 6 are not supported!
  ^~~~~

Signed-off-by: Anders Kaseorg <andersk@mit.edu>
@andersk
Copy link
Contributor Author

andersk commented Apr 6, 2018

We may yet have to disable tests even for non-CUDA builds, because the “distributed tests for the TCP backend with file init_method” seem to fail in the sandbox with errors like this.

Running distributed tests for the TCP backend with file init_method
Process process 2:
Traceback (most recent call last):
  File "/nix/store/lp7x6ziq5b2zihgad0i8a28xxvz5vnrr-python3-3.6.4/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/nix/store/lp7x6ziq5b2zihgad0i8a28xxvz5vnrr-python3-3.6.4/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "./test_distributed.py", line 551, in _run
    getattr(self, self.id().split(".")[2])()
  File "./test_distributed.py", line 510, in wrapper
    fn(self)
  File "./test_distributed.py", line 462, in test_all_gather
    group, group_id, rank = self._init_global_test()
  File "./test_distributed.py", line 111, in _init_global_test
    group = [i for i in range(0, dist.get_world_size())]
  File "/nix/store/xc4ipwb39f4gwblrkkanb0cpmqxin6zm-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/distributed/__init__.py", line 120, in get_worl
d_size
    assert torch.distributed._initialized
AssertionError

@andersk
Copy link
Contributor Author

andersk commented Apr 7, 2018

I patched out the test in question. Also, I found that the stub libcuda.so from cudatoolkit is sufficient to get the remaining tests running in the cudaSupport case, so I turned that back on.

};

pytorchWithoutCuda = self.pytorch.override {
cudaSupport = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to build the package on a system without cuda and found that setting cudaSupport = false is actualy not enough. This way cuda library dependency still is being built. It disappeared then I set cudatoolkit and cudnn to null explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d forgotten to let-bind cudatoolkit_joined, cudaStub, and cudaStubEnv. Does this work better now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersk looks like I messed with history. I actually tried my old version and stumbled upon issue mentioned here. #32438 (comment)

Still, yours commit 7a851ad fails for me with the message:

nix-shell -p python36Packages.pytorch --show-trace                                                                                  
error: while evaluating the attribute ‘buildInputs’ of the derivation ‘shell’ at /opt/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:98:11:
while evaluating the attribute ‘patches’ of the derivation ‘python3.6-pytorch-0.3.1’ at /opt/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:98:11:
while evaluating anonymous function at /opt/nixpkgs/pkgs/build-support/fetchpatch/default.nix:8:1, called from /opt/nixpkgs/pkgs/development/python-modules/pytorch/default.nix:41:6:
anonymous function at /opt/nixpkgs/pkgs/build-support/fetchurl/default.nix:38:1 called with unexpected argument ‘extraPrefix’, at /opt/nixpkgs/pkgs/build-support/fetchpatch/default.nix:10:1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to merge with master or release-18.03 or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged your branch to fresh master at e1dee4e as you suggested. Builds for me now, but fails on testing. I run this on ancient P6200 processor, probably that's the reason.

$ nix-shell -p python36Packages.pytorchWithoutCuda
<...>
installing
/tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source/dist /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source
Processing ./torch-0.3.1b0-cp36-cp36m-linux_x86_64.whl
Requirement already satisfied: numpy in /nix/store/q61pr9l2l31hqrkan0m4zd0wqv85cmnp-python3.6-numpy-1.14.1/lib/python3.6/site-packages (from torch==0.3.1b0)
Requirement already satisfied: pyyaml in /nix/store/rp8vvvf15fni2dxn1p46hv6ly3k0xca3-python3.6-PyYAML-3.12/lib/python3.6/site-packages (from torch==0.3.1b0)
Installing collected packages: torch
Successfully installed torch-0.3.1b0
/tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source
post-installation fixup
shrinking RPATHs of ELF executables and libraries in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_thnn/_THNN.cpython-36m-x86_64-linux-gnu.so
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTHNN.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libshm.so
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTHS.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/torch_shm_manager
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libATen.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTH.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_dl.cpython-36m-x86_64-linux-gnu.so
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
strip is /nix/store/fzcs0fn6bb04m82frhlb78nc03ny3w55-binutils-2.28.1/bin/strip
stripping (with command strip and flags -S) in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib 
patching script interpreter paths in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1
checking for references to /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0 in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1...
running install tests
/tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source/test /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source
Running JIT tests
s.......s.s.s..s....s.....s......ss.........
----------------------------------------------------------------------
Ran 44 tests in 0.285s

OK (skipped=9)
Running torch tests
...test/run_test.sh: line 27: 29365 Illegal instruction     (core dumped) $PYCMD test_torch.py $@
builder for ‘/nix/store/4bs5qi4jpnslv3aqr8m8ffp19xy1bi9d-python3.6-pytorch-0.3.1.drv’ failed with exit code 132hell -p python36Packages.pytorchWithoutCuda

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Builts with and without cuda on a more recent hardware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this section of your build log look like on your P6200?

-- Performing Test C_HAS_SSE1_1
-- Performing Test C_HAS_SSE1_1 - Success
-- Performing Test C_HAS_SSE2_1
-- Performing Test C_HAS_SSE2_1 - Success
-- Performing Test C_HAS_SSE3_1
-- Performing Test C_HAS_SSE3_1 - Failed
-- Performing Test C_HAS_SSE3_2
-- Performing Test C_HAS_SSE3_2 - Success
-- Performing Test C_HAS_SSE4_1_1
-- Performing Test C_HAS_SSE4_1_1 - Failed
-- Performing Test C_HAS_SSE4_1_2
-- Performing Test C_HAS_SSE4_1_2 - Success
-- Performing Test C_HAS_SSE4_2_1
-- Performing Test C_HAS_SSE4_2_1 - Failed
-- Performing Test C_HAS_SSE4_2_2
-- Performing Test C_HAS_SSE4_2_2 - Success
-- Performing Test C_HAS_AVX_1
-- Performing Test C_HAS_AVX_1 - Failed
-- Performing Test C_HAS_AVX_2
-- Performing Test C_HAS_AVX_2 - Success
-- Performing Test C_HAS_AVX2_1
-- Performing Test C_HAS_AVX2_1 - Failed
-- Performing Test C_HAS_AVX2_2
-- Performing Test C_HAS_AVX2_2 - Success
-- Performing Test CXX_HAS_SSE1_1
-- Performing Test CXX_HAS_SSE1_1 - Success
-- Performing Test CXX_HAS_SSE2_1
-- Performing Test CXX_HAS_SSE2_1 - Success
-- Performing Test CXX_HAS_SSE3_1
-- Performing Test CXX_HAS_SSE3_1 - Failed
-- Performing Test CXX_HAS_SSE3_2
-- Performing Test CXX_HAS_SSE3_2 - Success
-- Performing Test CXX_HAS_SSE4_1_1
-- Performing Test CXX_HAS_SSE4_1_1 - Failed
-- Performing Test CXX_HAS_SSE4_1_2
-- Performing Test CXX_HAS_SSE4_1_2 - Success
-- Performing Test CXX_HAS_SSE4_2_1
-- Performing Test CXX_HAS_SSE4_2_1 - Failed
-- Performing Test CXX_HAS_SSE4_2_2
-- Performing Test CXX_HAS_SSE4_2_2 - Success
-- Performing Test CXX_HAS_AVX_1
-- Performing Test CXX_HAS_AVX_1 - Failed
-- Performing Test CXX_HAS_AVX_2
-- Performing Test CXX_HAS_AVX_2 - Success
-- Performing Test CXX_HAS_AVX2_1
-- Performing Test CXX_HAS_AVX2_1 - Failed
-- Performing Test CXX_HAS_AVX2_2
-- Performing Test CXX_HAS_AVX2_2 - Success
-- SSE2 Found
-- SSE3 Found
-- AVX Found
-- AVX2 Found

Should we add sse2Support, sse3Support, avxSupport, avx2Support flags for better determinism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Found OpenMP_CXX: -fopenmp  
-- Found OpenMP: TRUE   
-- Compiling with OpenMP support
-- Could not find hardware support for NEON on this machine.
-- No OMAP3 processor on this machine.
-- No OMAP4 processor on this machine.
-- Looking for cpuid.h
-- Looking for cpuid.h - found
-- Performing Test HAVE_GCC_GET_CPUID
-- Performing Test HAVE_GCC_GET_CPUID - Success
-- Performing Test NO_GCC_EBX_FPIC_BUG
-- Performing Test NO_GCC_EBX_FPIC_BUG - Success
-- Performing Test C_HAS_SSE1_1
-- Performing Test C_HAS_SSE1_1 - Success
-- Performing Test C_HAS_SSE2_1
-- Performing Test C_HAS_SSE2_1 - Success
-- Performing Test C_HAS_SSE3_1
-- Performing Test C_HAS_SSE3_1 - Failed
-- Performing Test C_HAS_SSE3_2
-- Performing Test C_HAS_SSE3_2 - Success
-- Performing Test C_HAS_SSE4_1_1
-- Performing Test C_HAS_SSE4_1_1 - Failed
-- Performing Test C_HAS_SSE4_1_2
-- Performing Test C_HAS_SSE4_1_2 - Success
-- Performing Test C_HAS_SSE4_2_1
-- Performing Test C_HAS_SSE4_2_1 - Failed
-- Performing Test C_HAS_SSE4_2_2
-- Performing Test C_HAS_SSE4_2_2 - Success
-- Performing Test C_HAS_AVX_1
-- Performing Test C_HAS_AVX_1 - Failed
-- Performing Test C_HAS_AVX_2
-- Performing Test C_HAS_AVX_2 - Success
-- Performing Test C_HAS_AVX2_1
-- Performing Test C_HAS_AVX2_1 - Failed
-- Performing Test C_HAS_AVX2_2
-- Performing Test C_HAS_AVX2_2 - Success
-- Performing Test CXX_HAS_SSE1_1
-- Performing Test CXX_HAS_SSE1_1 - Success
-- Performing Test CXX_HAS_SSE2_1
-- Performing Test CXX_HAS_SSE2_1 - Success
-- Performing Test CXX_HAS_SSE3_1
-- Performing Test CXX_HAS_SSE3_1 - Failed
-- Performing Test CXX_HAS_SSE3_2
-- Performing Test CXX_HAS_SSE3_2 - Success
-- Performing Test CXX_HAS_SSE4_1_1
-- Performing Test CXX_HAS_SSE4_1_1 - Failed
-- Performing Test CXX_HAS_SSE4_1_2
-- Performing Test CXX_HAS_SSE4_1_2 - Success
-- Performing Test CXX_HAS_SSE4_2_1
-- Performing Test CXX_HAS_SSE4_2_1 - Failed
-- Performing Test CXX_HAS_SSE4_2_2
-- Performing Test CXX_HAS_SSE4_2_2 - Success
-- Performing Test CXX_HAS_AVX_1
-- Performing Test CXX_HAS_AVX_1 - Failed
-- Performing Test CXX_HAS_AVX_2
-- Performing Test CXX_HAS_AVX_2 - Success
-- Performing Test CXX_HAS_AVX2_1
-- Performing Test CXX_HAS_AVX2_1 - Failed
-- Performing Test CXX_HAS_AVX2_2
-- Performing Test CXX_HAS_AVX2_2 - Success
-- SSE2 Found
-- SSE3 Found
-- AVX Found
-- AVX2 Found
-- Performing Test HAS_C11_ATOMICS
-- Performing Test HAS_C11_ATOMICS - Success
-- TH_SO_VERSION: 1
-- Atomics: using C11 intrinsics

Similar to NixOS#30058 for TensorFlow.

Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
@akamaus
Copy link
Contributor

akamaus commented May 4, 2018

@FRidh Why don't we merge it? Let's not wait until it no longer applies like my previous pull request. There're some issues on older systems, we see. But package works for majority of users, I suppose. Currently we have 0.2 version in nixpkgs which almost unusable because it lacks cuda support altogether.

@FRidh
Copy link
Member

FRidh commented May 4, 2018

Sometimes a PR does not get any attention from any of the maintainers simply because they have not bothered with it. It's important in such cases that the community (say you and @andersk) push for the change, and one of the first steps would be to ask the maintainer of the expression (@teh) to review.

Looking at the change and the discussion, I don't see any clear consensus that it works, and thus it would appear to me as a work-in-progress. Furthermore, as the maintainer has not chimed in on it yet, I won't merge it yet.

@teh
Copy link
Contributor

teh commented May 4, 2018

This is a continuation of #32438 which I think was almost ready.

I'm OK with this being merged - I am already running a slight variant of this in our private nixpkgs tree.

@FRidh FRidh merged commit ce00943 into NixOS:master May 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants