Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch: 1.0.0 -> 1.2.0 #65041

Merged
merged 4 commits into from Oct 23, 2019
Merged

pytorch: 1.0.0 -> 1.2.0 #65041

merged 4 commits into from Oct 23, 2019

Conversation

stites
Copy link
Member

@stites stites commented Jul 18, 2019

Motivation for this change

Closes #63073
Closes #65041
Closes #67468

This depends on @tbenst 's #61347 (thank you!). See #63073 and this PR thread for more details.

Work can be previewed at stites/pytorch-world. Building from source takes a long time, please check out the hosted cachix binaries

Changelog:

  • updates to pytorch-1.2.0
  • runs all tests except for utils
  • adds a cudaArchList argument that allows users to test on the latest nvidia hardware without waiting for this package.
  • adds mkl support to magma-2.5.0
  • adds cuda support to openMPI
  • adds openMPI support
  • allows users to build NamedTensors
  • allows users to get access to generated binaries from build process
  • adds .dev output which gives top-level access to generated C/C++ code (libtorch).
  • adds darwin support if cudaSupport is disabled.
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Copy link
Contributor

@tbenst tbenst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Hopefully we can get this as feature complete as possible.

@@ -72,6 +67,12 @@ in buildPythonPackage rec {
PYTORCH_BUILD_VERSION = version;
PYTORCH_BUILD_NUMBER = 0;

USE_FBGEMM = 0; # this can't build because of CMAKE downloads
USE_MKLDNN = 0; # mkl with cuda is broken
USE_NCCL = 0; # multigpu looks broken broken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try including NCCL?

Copy link
Member Author

@stites stites Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately. I reviewed setup.py and will try using their bundled NCCL (via USE_SYSTEM_NCCL=0), also setting NCCL variables manually (NCCL_ROOT_DIR, NCCL_LIB_DIR, NCCL_INCLUDE_DIR), and making sure that nccl.out and nccl.dev are included.

None of these seem to work. You will quickly (<10m) get the following errors:

nvlink error   : Undefined reference to '_Z25ncclAllReduceRing_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z27ncclAllReduceRingLL_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z25ncclAllReduceTree_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z27ncclAllReduceTreeLL_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z25ncclAllReduceRing_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z27ncclAllReduceRingLL_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z25ncclAllReduceTree_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error   : Undefined reference to '_Z27ncclAllReduceTreeLL_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
make[5]: *** [Makefile:68: /build/source/build/nccl/obj/collectives/device/devlink.o] Error 255
make[4]: *** [Makefile:44: /build/source/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[4]: *** Waiting for unfinished jobs....

There are a lot of these undefined references, all coming from /build/source/build/nccl/obj/collectives/device/functions.o.

Copy link
Member Author

@stites stites Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is because nvlink is attempting to perform checks in the GPU runtime?

Update: I found the symbols in "${nccl.dev}"/lib/libnccl_static.a but they seem to fail lookups on build, even when they are included in $NIX_LDFLAGS. Manually setting NCCL_* environment variables seems to fix this!

@@ -36,14 +39,6 @@ in buildPythonPackage rec {
sha256 = "076cpbig4sywn9vv674c0xdg832sdrd5pk1d0725pjkm436kpvlm";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this SHA correct? Looks the same to me as 1.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! Yes, I'm primarily working off of https://github.com/stites/libtorch-nix and missed this when transferring over details. Good catch -- I'm a bit surprised this slipped through since I was building off of my fork (using the command from the PR message).

@@ -72,6 +67,12 @@ in buildPythonPackage rec {
PYTORCH_BUILD_VERSION = version;
PYTORCH_BUILD_NUMBER = 0;

USE_FBGEMM = 0; # this can't build because of CMAKE downloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth reviewing the instructions at https://github.com/pytorch/FBGEMM. Looks like we can avoid downloads by setting ASMJIT_SRC_DIR, CPUINFO_SRC_DIR, GOOGLETEST_SOURCE_DIR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did set these as environment variables, using pointers from your gist: https://gist.github.com/tbenst/15495f97ffe2a71dc18c49e247af8f51#file-pytorch_11-nix-L62

Unfortunately, these didn't seem to propagate through to CMake. I'll do a round-two later.

@@ -72,6 +67,12 @@ in buildPythonPackage rec {
PYTORCH_BUILD_VERSION = version;
PYTORCH_BUILD_NUMBER = 0;

USE_FBGEMM = 0; # this can't build because of CMAKE downloads
USE_MKLDNN = 0; # mkl with cuda is broken
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MKL with cuda is not broken, actually -- just MKLDNN. The problem seems to be a warning which is treated as an error:

/build/source/third_party/ideep/mkl-dnn/src/cpu/jit_uni_reorder_utils.cpp: In function ‘void mkldnn::impl::cpu::tr::prb_normalize(mkldnn::impl::cpu::tr::prb_t&)’:
/build/source/third_party/ideep/mkl-dnn/src/cpu/jit_uni_reorder_utils.cpp:273:29: error: array subscript is above array bounds [-Werror=array-bounds]
                 || p.nodes[j].os < p.nodes[min_pos].os
                    ~~~~~~~~~^
/build/source/third_party/ideep/mkl-dnn/src/cpu/jit_uni_reorder_utils.cpp:276:37: error: array subscript is above array bounds [-Werror=array-bounds]
                         && p.nodes[j].n < p.nodes[min_pos].n);
                            ~~~~~~~~~^
cc1plus: all warnings being treated as errors
make[2]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/build.make:1272: third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/cpu/jit_uni_reorder_utils.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the issue: pytorch/pytorch#22346

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the workaround at https://github.com/NixOS/nixpkgs/blob/fb76f093c9ab204fbec9dc9c56cb9b4a81f43bc7/pkgs/development/python-modules/pytorch/default.nix#L79

I'll be adding a mklSupport ? false which will swap out the numpy BLAS implementation so that this is more explicit. The working code is similar to my_magma:

  my_numpy = if mklSupport && numpy.blasImplementation != "mkl" then numpy.override { blas = mkl; } else numpy;

@stites stites mentioned this pull request Jul 25, 2019
10 tasks
@stites stites changed the title pytorch: 1.0.0 -> 1.1.0 pytorch: 1.0.0 -> 1.1.0 (WIP, depends on #61347) Jul 25, 2019
@stites
Copy link
Member Author

stites commented Jul 25, 2019

I've just rebased this on top of master, which means I need to cherry-pick and remove the #61347 commit (65494f2) before this is finalized. I don't think this should be a problem since I believe everything needs to be squashed anyways.

I've updated the PR with the following:

  • NCCL support is back
  • DISTRIBUTED is turned back on
  • OpenMPI support is now available (I needed to get the latest master so that I could update it's default.nix with CUDA support). I am tempted to change openMPISupport to distributedSupport since they seem synonymous.
  • SHA has been corrected
  • mkl support is synced with numpy and correctly builds MKL_DNN
  • ninja and setuptools have been added to the build (pytorch makes them optional, but I think this speeds up the build a bit).

Things that still need to be worked on (a check means that I have these working locally and/or in libtorch-nix).

For updates on this work please check out https://github.com/stites/libtorch-nix. @tscholak is also experimentally building the C++ libtorch on darwin and might also be able to verify pytorch compatibility on OSX (possibly disabling cuda support).

@stites
Copy link
Member Author

stites commented Jul 26, 2019

FBGEMM works in https://github.com/stites/libtorch-nix/blob/master/deps/fbgemm.nix but, unfortunately, setup.py seems to hide arguments passed into cmakeFlags from cmake.

The problem, found in #61820, is that pytorch didn't have the fbgemm dependencies under submodule control until a commit in May (1.1.0 was released in April). Unfortunately, git submodules cannot be patched. Furthermore $src is read-only, so hard-coding cmake variables (via a preconfigure command: sed -i '10a\\\nset(SOURCE_DIR ${source})' $src/CMakeLists.txt) is also not an option.

Since FBGEMM is optional, I think the most simple solution is to put a version bound that disables FBGEMM unless pytorch is >1.1.0. This way FBGEMM will automatically turn on if someone bumps the version and forgets about this workaround. According to the milestones, it looks like 1.1.1 might be cut for release soon.

@stites
Copy link
Member Author

stites commented Jul 30, 2019

I asked around on the pytorch slack and the 1.1.1 release is going to be cut this week, so I think waiting is the best course of action.

For posterity, if anyone is trying to build pytorch-1.1.0, this PR is fully-functional without FBGEMM -- which provides optional, low-precision general matrix multiplication for small batch sizes, various MM-based optimizations, and fused operators. If you are working off of an x86 CPU backend, however, having FBGEMM might be a bit of a nessecity.

@stites
Copy link
Member Author

stites commented Aug 8, 2019

https://github.com/pytorch/pytorch/releases/tag/v1.2.0

Looks like they skipped the 1.1.1 release. I'll update this PR for pytorch-1.2.0.

Update: unfortunately, pytorch-1.2.0 brings back the NCCL issues.

Here is the build matrix I am working with:

Python-3.6.9 support pytorch-v1.1.0 pytorch-v1.2.0
vanilla ✔️ ✔️
mkl+mkldnn ✔️ ✔️
openMPI ✔️ ✔️
FBGEMM ✔️
Full CPU (all above)* ✔️ ✔️
Full CPU + NamedTensors + Binaries ✔️
CUDA ✔️ ✔️
CUDA+NCCL ✔️ ✔️
CUDA+NCCL+openMPI ✔️ ✔️
CUDA+NCCL+openMPI+mkl ✔️ ✔️
Full CUDA (all above)* ✔️ ✔️
Full CUDA + NamedTensors + Binaries ✔️

The ":grey_exclamation:" means that namedtensors and binaries weren't attempted in pytorch-v1.1.0.

The "*" implies that FBGEMM is not included in any of the "Full" builds in pytorch-v1.1.0, but is included for all of pytorch-v1.2.0

I've also started including previously ignored tests for pytorch-1.2.0. It looks like we can get away with only disabling utils which relies on making a git-clone.

@stites stites changed the title pytorch: 1.0.0 -> 1.1.0 (WIP, depends on #61347) pytorch: 1.0.0 -> 1.2.0 (WIP, depends on #61347) Aug 8, 2019
@stites
Copy link
Member Author

stites commented Aug 14, 2019

Still working on this, slowly, in pytorch-world and am updating the above build matrix to cut down on github notifications -- I'll make a final commit and message eventually. I am also going to set platforms = with lib.platforms; [linux] ++ lib.optionals (!cudaSupport) [darwin]; unless there are any objections.

@stites
Copy link
Member Author

stites commented Aug 14, 2019

I made an update-worthy breakthrough for pytorch-1.2.0. The errors I saw weren't with NCCL, but with the 3.0 CUDA architectures, documented in pytorch/pytorch#23573. I'm currently trying to figure out how to polish this up and did notice that cmake was throwing this warning:

CMake Warning at cmake/public/utils.cmake:172 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.

Which is quite troubling. So far, we have been using buildPythonPackage to install pytorch, but I think we could can use the Adjust Build Options instructions for more flexibility around cmake (so this would have cracked the FBGEMM problem in 1.1.0), as well as future proof the pytorch derivation.

I'll be looking at adjusting buildPythonPackage next.


NOTE: Because of sandboxing, the build can't auto-detect the hardware's cuda architecture, so there is also now a problem around new architectures not being supported until explicitly added to this derivation.

@stites
Copy link
Member Author

stites commented Aug 20, 2019

buildPythonPackage is actually a bit too much for me right now, this has been added as a comment in https://github.com/stites/pytorch-world and will be slotted for "future work." All variants of pytorch-1.2 have been built and tested on python-3.6 -- cachix binaries are also available.

I'll update this PR this week -- I've also added myself as a maintainer for this package.

@stites stites changed the title pytorch: 1.0.0 -> 1.2.0 (WIP, depends on #61347) pytorch: 1.0.0 -> 1.2.0 (depends on #61347) Aug 29, 2019
@stites
Copy link
Member Author

stites commented Aug 29, 2019

I've gone ahead and pushed up the latest changes, pytorch-1.2.0 is fully functional! Note that this is still blocked by #61347 and I can squash the commits if desired, but this PR now touches openmpi (adding cuda support to it) and magma (adding mkl support to it).

Final notes:

  • I've added a cudaArchList because of notes above regarding sandboxing -- this should allow users to test on the latest gpu hardware without waiting for this package. The arch lists haven't been vetted and I just notice a minute ago that cuda-9 was broken, so I've split it up into -9 and -10 versions. This logic feels dirty.
  • I've added dev outputs -- this is a quick hack that gives users like hasktorch (@tscholak) the generated C/C++ backend (thanks @cleverca22!). I haven't tested this, but you have to use python36Packages.callPackage (example here). There's probably a better way to do this (possibly promoting pytorch to top-level and using the "Adjust Build Options" instructions).
  • Because the original derivation mutates numpy via passthrough, it is very easy to get a "duplicate numpy package found." I am thinking it would be safer to assert numpy.blasImplementation == "mkl" if mklSupport is present, but I also used the same trick for magma-mkl support and openmpi-cuda support, so ¯_(ツ)_/¯.

This is ready for review. Again, not sure if I need to squash all of this or split up the PR into parts. Because I just found out how to properly support cuda-9, cachix binaries are pending -- I'll comment one last time to ping folks who want to test-drive this (although only the vanilla cuda-9 build is in there).

@stites
Copy link
Member Author

stites commented Oct 17, 2019

Rebased! After this PR, I'll start on 1.3.0

@ofborg ofborg bot requested a review from teh October 17, 2019 21:25
@teh
Copy link
Contributor

teh commented Oct 20, 2019

@jonringer good to merge?

@jonringer
Copy link
Contributor

even with my 3900X, this takes a while to build maxing out all 24 threads >.>

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is caffe2 supposed to be bundled with pytorch?

[nix-shell:/home/jon/.cache/nix-review/pr-65041-1]$ tree -L 4 ./results/python37Packages.pytorchWithCuda/
./results/python37Packages.pytorchWithCuda/
├── bin
│   ├── convert-caffe2-to-onnx
│   └── convert-onnx-to-caffe2
└── lib
    └── python3.7
        └── site-packages
            ├── caffe2
            ├── torch
            └── torch-1.2.0.dist-info

if so, please add click to the propagatedBuildInputs:

[nix-shell:/home/jon/.cache/nix-review/pr-65041-1]$ ./results/python37Packages.pytorchWithCuda/bin/convert-caffe2-to-onnx --help
Traceback (most recent call last):
  File "/nix/store/cbrg1ldydrl4wb6v4xpkkfh2jbmhwaiv-python3.7-pytorch-1.2.0/bin/.convert-caffe2-to-onnx-wrapped", line 7, in <module>
    from caffe2.python.onnx.bin.conversion import caffe2_to_onnx
  File "/nix/store/5ck5ncbzq8pr9aynzyp3j2y0bhk0v42i-python3.7-pytorch-1.2.0/lib/python3.7/site-packages/caffe2/python/onnx/bin/conversion.py", line 11, in <module>
    import click
ModuleNotFoundError: No module named 'click'

@jonringer
Copy link
Contributor

closes #69424

@stites
Copy link
Member Author

stites commented Oct 20, 2019

I believe caffe2 is used for the "production-ready" features of pytorch.

I've added click to the propagatedBuildInputs. My ML box is currently down (just moved offices), so this is a bit of an optimistic push.

@jonringer
Copy link
Contributor

I'm building and testing now. If it seems to work in the trivial case, I'll merge :)

@jonringer
Copy link
Contributor

@GrahamcOfBorg eval

@jonringer
Copy link
Contributor

not sure how much you use onnx, but I added a cpu-only version to nixpkgs #67542

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nix-review passes on NixOS (python38Packages.pytest is broken)
diff LGTM

[nix-shell:/home/jon/.cache/nix-review/pr-65041-2]$ python
Python 3.7.4 (default, Jul  8 2019, 18:31:06)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.2.0'
[3 built (2 failed), 0.0 MiB DL]
error: build of '/nix/store/zqv58qrag64v4db7ljcmk62r3w4vjjd7-env.drv' failed
https://github.com/NixOS/nixpkgs/pull/65041
3 package failed to build:
python38Packages.pytorch python38Packages.pytorchWithCuda python38Packages.torchvision

4 package were build:
magma python37Packages.pytorch python37Packages.pytorchWithCuda python37Packages.torchvision

the caffe2 executables still need some love, but I'll do that in a different PR seeing as this one has been going on for close to 3 months

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooops, to conform to contirbuting.md, please break out the work into the following commits:

maintainers: add tscholak
openmpi: enable cuda support
magma: use cudatoolkit
python3Packages.pytorch: 1.0.0 -> 1.2.0

@stites
Copy link
Member Author

stites commented Oct 23, 2019

quick ping to mention that this change was made!

@jonringer
Copy link
Contributor

I might wait for #71780 to get in, as it has significant improvements to python38. And it's really hard to not have 480 commits bit rot fast

@jonringer
Copy link
Contributor

fridh said he would de-conflict the merge at a later time.

@jonringer jonringer merged commit dfa8e8b into NixOS:master Oct 23, 2019
@stites stites deleted the pytorch_11 branch October 23, 2019 19:09
@stites
Copy link
Member Author

stites commented Oct 24, 2019

Heads-up: an initial build of pytorch-1.3.0 succeeds on python3.7 with and without cuda. Test are failing do to some dependencies issues.

I'm also about to be very busy for the next week -- expect a PR after.

@jonringer
Copy link
Contributor

no sweat, take your time man. We appreciate your effort :)

@stites
Copy link
Member Author

stites commented Oct 24, 2019

vanilla pytorch-1.3.0 builds with a bumped tensorboard-2.0 (incoming PR). Triggering a full dependency grid-sweep now with tests.

Looks like there is still an issue in the grid sweep. pytorch-1.3 will be delayed a bit.

@danieldk
Copy link
Contributor

danieldk commented Dec 6, 2019

I think 2947c23 should also be backported to 19.09? magma builds with CUDA on 19.09 fail with:

In file included from /nix/store/lzwx11791zlw77swqakrzjrkm5mz84cq-cudatoolkit-10.0.130/include/cuda_runtime.h:83,
                 from <command-line>:
/nix/store/lzwx11791zlw77swqakrzjrkm5mz84cq-cudatoolkit-10.0.130/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
 #error -- unsupported GNU version! gcc versions later than 7 are not supported!
  ^~~~~

stites added a commit to stites/pytorch-world that referenced this pull request Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pytorchWithCuda fails to run pytorch-1.1.0