-
-
Notifications
You must be signed in to change notification settings - Fork 15.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch: 0.2.0 → 0.3.1 with CUDA and cuDNN #38530
Conversation
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Fixes this error: In file included from /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/host_config.h:50:0, from /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/cuda_runtime.h:78, from <command-line>:0: /nix/store/gv7w3c71jg627cpcff04yi6kwzpzjyap-cudatoolkit-9.1.85.1/include/crt/host_config.h:121:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported! #error -- unsupported GNU version! gcc versions later than 6 are not supported! ^~~~~ Signed-off-by: Anders Kaseorg <andersk@mit.edu>
We may yet have to disable tests even for non-CUDA builds, because the “distributed tests for the TCP backend with file init_method” seem to fail in the sandbox with errors like this.
|
I patched out the test in question. Also, I found that the stub libcuda.so from cudatoolkit is sufficient to get the remaining tests running in the |
}; | ||
|
||
pytorchWithoutCuda = self.pytorch.override { | ||
cudaSupport = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to build the package on a system without cuda and found that setting cudaSupport = false is actualy not enough. This way cuda library dependency still is being built. It disappeared then I set cudatoolkit and cudnn to null explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d forgotten to let
-bind cudatoolkit_joined
, cudaStub
, and cudaStubEnv
. Does this work better now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersk looks like I messed with history. I actually tried my old version and stumbled upon issue mentioned here. #32438 (comment)
Still, yours commit 7a851ad fails for me with the message:
nix-shell -p python36Packages.pytorch --show-trace
error: while evaluating the attribute ‘buildInputs’ of the derivation ‘shell’ at /opt/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:98:11:
while evaluating the attribute ‘patches’ of the derivation ‘python3.6-pytorch-0.3.1’ at /opt/nixpkgs/pkgs/stdenv/generic/make-derivation.nix:98:11:
while evaluating anonymous function at /opt/nixpkgs/pkgs/build-support/fetchpatch/default.nix:8:1, called from /opt/nixpkgs/pkgs/development/python-modules/pytorch/default.nix:41:6:
anonymous function at /opt/nixpkgs/pkgs/build-support/fetchurl/default.nix:38:1 called with unexpected argument ‘extraPrefix’, at /opt/nixpkgs/pkgs/build-support/fetchpatch/default.nix:10:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to merge with master or release-18.03 or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged your branch to fresh master at e1dee4e as you suggested. Builds for me now, but fails on testing. I run this on ancient P6200 processor, probably that's the reason.
$ nix-shell -p python36Packages.pytorchWithoutCuda
<...>
installing
/tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source/dist /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source
Processing ./torch-0.3.1b0-cp36-cp36m-linux_x86_64.whl
Requirement already satisfied: numpy in /nix/store/q61pr9l2l31hqrkan0m4zd0wqv85cmnp-python3.6-numpy-1.14.1/lib/python3.6/site-packages (from torch==0.3.1b0)
Requirement already satisfied: pyyaml in /nix/store/rp8vvvf15fni2dxn1p46hv6ly3k0xca3-python3.6-PyYAML-3.12/lib/python3.6/site-packages (from torch==0.3.1b0)
Installing collected packages: torch
Successfully installed torch-0.3.1b0
/tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source
post-installation fixup
shrinking RPATHs of ELF executables and libraries in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_thnn/_THNN.cpython-36m-x86_64-linux-gnu.so
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTHNN.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libshm.so
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTHS.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/torch_shm_manager
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libATen.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/lib/libTH.so.1
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_dl.cpython-36m-x86_64-linux-gnu.so
shrinking /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
strip is /nix/store/fzcs0fn6bb04m82frhlb78nc03ny3w55-binutils-2.28.1/bin/strip
stripping (with command strip and flags -S) in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1/lib
patching script interpreter paths in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1
checking for references to /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0 in /nix/store/5v8kzkyk9k11icfkzf0vbf7vr3rpqgdw-python3.6-pytorch-0.3.1...
running install tests
/tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source/test /tmp/nix-build-python3.6-pytorch-0.3.1.drv-0/source
Running JIT tests
s.......s.s.s..s....s.....s......ss.........
----------------------------------------------------------------------
Ran 44 tests in 0.285s
OK (skipped=9)
Running torch tests
...test/run_test.sh: line 27: 29365 Illegal instruction (core dumped) $PYCMD test_torch.py $@
builder for ‘/nix/store/4bs5qi4jpnslv3aqr8m8ffp19xy1bi9d-python3.6-pytorch-0.3.1.drv’ failed with exit code 132hell -p python36Packages.pytorchWithoutCuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Builts with and without cuda on a more recent hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this section of your build log look like on your P6200?
-- Performing Test C_HAS_SSE1_1
-- Performing Test C_HAS_SSE1_1 - Success
-- Performing Test C_HAS_SSE2_1
-- Performing Test C_HAS_SSE2_1 - Success
-- Performing Test C_HAS_SSE3_1
-- Performing Test C_HAS_SSE3_1 - Failed
-- Performing Test C_HAS_SSE3_2
-- Performing Test C_HAS_SSE3_2 - Success
-- Performing Test C_HAS_SSE4_1_1
-- Performing Test C_HAS_SSE4_1_1 - Failed
-- Performing Test C_HAS_SSE4_1_2
-- Performing Test C_HAS_SSE4_1_2 - Success
-- Performing Test C_HAS_SSE4_2_1
-- Performing Test C_HAS_SSE4_2_1 - Failed
-- Performing Test C_HAS_SSE4_2_2
-- Performing Test C_HAS_SSE4_2_2 - Success
-- Performing Test C_HAS_AVX_1
-- Performing Test C_HAS_AVX_1 - Failed
-- Performing Test C_HAS_AVX_2
-- Performing Test C_HAS_AVX_2 - Success
-- Performing Test C_HAS_AVX2_1
-- Performing Test C_HAS_AVX2_1 - Failed
-- Performing Test C_HAS_AVX2_2
-- Performing Test C_HAS_AVX2_2 - Success
-- Performing Test CXX_HAS_SSE1_1
-- Performing Test CXX_HAS_SSE1_1 - Success
-- Performing Test CXX_HAS_SSE2_1
-- Performing Test CXX_HAS_SSE2_1 - Success
-- Performing Test CXX_HAS_SSE3_1
-- Performing Test CXX_HAS_SSE3_1 - Failed
-- Performing Test CXX_HAS_SSE3_2
-- Performing Test CXX_HAS_SSE3_2 - Success
-- Performing Test CXX_HAS_SSE4_1_1
-- Performing Test CXX_HAS_SSE4_1_1 - Failed
-- Performing Test CXX_HAS_SSE4_1_2
-- Performing Test CXX_HAS_SSE4_1_2 - Success
-- Performing Test CXX_HAS_SSE4_2_1
-- Performing Test CXX_HAS_SSE4_2_1 - Failed
-- Performing Test CXX_HAS_SSE4_2_2
-- Performing Test CXX_HAS_SSE4_2_2 - Success
-- Performing Test CXX_HAS_AVX_1
-- Performing Test CXX_HAS_AVX_1 - Failed
-- Performing Test CXX_HAS_AVX_2
-- Performing Test CXX_HAS_AVX_2 - Success
-- Performing Test CXX_HAS_AVX2_1
-- Performing Test CXX_HAS_AVX2_1 - Failed
-- Performing Test CXX_HAS_AVX2_2
-- Performing Test CXX_HAS_AVX2_2 - Success
-- SSE2 Found
-- SSE3 Found
-- AVX Found
-- AVX2 Found
Should we add sse2Support
, sse3Support
, avxSupport
, avx2Support
flags for better determinism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- Found OpenMP_CXX: -fopenmp
-- Found OpenMP: TRUE
-- Compiling with OpenMP support
-- Could not find hardware support for NEON on this machine.
-- No OMAP3 processor on this machine.
-- No OMAP4 processor on this machine.
-- Looking for cpuid.h
-- Looking for cpuid.h - found
-- Performing Test HAVE_GCC_GET_CPUID
-- Performing Test HAVE_GCC_GET_CPUID - Success
-- Performing Test NO_GCC_EBX_FPIC_BUG
-- Performing Test NO_GCC_EBX_FPIC_BUG - Success
-- Performing Test C_HAS_SSE1_1
-- Performing Test C_HAS_SSE1_1 - Success
-- Performing Test C_HAS_SSE2_1
-- Performing Test C_HAS_SSE2_1 - Success
-- Performing Test C_HAS_SSE3_1
-- Performing Test C_HAS_SSE3_1 - Failed
-- Performing Test C_HAS_SSE3_2
-- Performing Test C_HAS_SSE3_2 - Success
-- Performing Test C_HAS_SSE4_1_1
-- Performing Test C_HAS_SSE4_1_1 - Failed
-- Performing Test C_HAS_SSE4_1_2
-- Performing Test C_HAS_SSE4_1_2 - Success
-- Performing Test C_HAS_SSE4_2_1
-- Performing Test C_HAS_SSE4_2_1 - Failed
-- Performing Test C_HAS_SSE4_2_2
-- Performing Test C_HAS_SSE4_2_2 - Success
-- Performing Test C_HAS_AVX_1
-- Performing Test C_HAS_AVX_1 - Failed
-- Performing Test C_HAS_AVX_2
-- Performing Test C_HAS_AVX_2 - Success
-- Performing Test C_HAS_AVX2_1
-- Performing Test C_HAS_AVX2_1 - Failed
-- Performing Test C_HAS_AVX2_2
-- Performing Test C_HAS_AVX2_2 - Success
-- Performing Test CXX_HAS_SSE1_1
-- Performing Test CXX_HAS_SSE1_1 - Success
-- Performing Test CXX_HAS_SSE2_1
-- Performing Test CXX_HAS_SSE2_1 - Success
-- Performing Test CXX_HAS_SSE3_1
-- Performing Test CXX_HAS_SSE3_1 - Failed
-- Performing Test CXX_HAS_SSE3_2
-- Performing Test CXX_HAS_SSE3_2 - Success
-- Performing Test CXX_HAS_SSE4_1_1
-- Performing Test CXX_HAS_SSE4_1_1 - Failed
-- Performing Test CXX_HAS_SSE4_1_2
-- Performing Test CXX_HAS_SSE4_1_2 - Success
-- Performing Test CXX_HAS_SSE4_2_1
-- Performing Test CXX_HAS_SSE4_2_1 - Failed
-- Performing Test CXX_HAS_SSE4_2_2
-- Performing Test CXX_HAS_SSE4_2_2 - Success
-- Performing Test CXX_HAS_AVX_1
-- Performing Test CXX_HAS_AVX_1 - Failed
-- Performing Test CXX_HAS_AVX_2
-- Performing Test CXX_HAS_AVX_2 - Success
-- Performing Test CXX_HAS_AVX2_1
-- Performing Test CXX_HAS_AVX2_1 - Failed
-- Performing Test CXX_HAS_AVX2_2
-- Performing Test CXX_HAS_AVX2_2 - Success
-- SSE2 Found
-- SSE3 Found
-- AVX Found
-- AVX2 Found
-- Performing Test HAS_C11_ATOMICS
-- Performing Test HAS_C11_ATOMICS - Success
-- TH_SO_VERSION: 1
-- Atomics: using C11 intrinsics
Similar to NixOS#30058 for TensorFlow. Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
@FRidh Why don't we merge it? Let's not wait until it no longer applies like my previous pull request. There're some issues on older systems, we see. But package works for majority of users, I suppose. Currently we have 0.2 version in nixpkgs which almost unusable because it lacks cuda support altogether. |
Sometimes a PR does not get any attention from any of the maintainers simply because they have not bothered with it. It's important in such cases that the community (say you and @andersk) push for the change, and one of the first steps would be to ask the maintainer of the expression (@teh) to review. Looking at the change and the discussion, I don't see any clear consensus that it works, and thus it would appear to me as a work-in-progress. Furthermore, as the maintainer has not chimed in on it yet, I won't merge it yet. |
This is a continuation of #32438 which I think was almost ready. I'm OK with this being merged - I am already running a slight variant of this in our private nixpkgs tree. |
Motivation for this change
This upgrades PyTorch and adds CUDA and cuDNN support. It’s an updated version of #32438 “pytorch-0.3 with cuda and cudnn” and includes those commits.
Things done
build-use-sandbox
innix.conf
on non-NixOS)nix-shell -p nox --run "nox-review wip"
./result/bin/
)