Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.pytorch: 1.2.0 -> 1.4.1, python3Packages.ignite: 0.2.1 -> 0.3.0 #75827

Merged
merged 4 commits into from May 9, 2020

Conversation

stites
Copy link
Member

@stites stites commented Dec 17, 2019

Motivation for this change

Update pytorch to 1.3.1. A detailed commit history can be seen here.
Check phase verified on python36.pytorch, python36.pytorchWithMkl, python36.pytorchWithCuda10, python36.pytorchWithCuda10Mkl.

Cachix pending (my machine with the keys is down).

Relevant changelog:

  • buildDocs flag added
  • buildNamedTensor is now true by default
  • adds experimental useNixProtobuf but disables this functionality.
  • uses CMake directly for pytorch builds
To build

I believe the following should work:

git remote add --fetch stites git@github.com:stites/nixpkgs.git
git checkout --track stites pytorch-1.3.1
nix-shell -I nixpkgs=$(pwd) -p "python37.withPackages(ps: with ps; [ pytorch ])"
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
Notify maintainers

cc @teh @thoughtpolice @stites @tscholak @bhipple

@stites stites marked this pull request as ready for review December 18, 2019 01:27
@stites stites force-pushed the pytorch-1.3.1 branch 2 times, most recently from cb1d7a0 to f138203 Compare December 18, 2019 01:35
Copy link
Member Author

@stites stites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the nix-protobuf code (it will just live in pytorch-world for now). Rebuilding everything again -- I just had to deal with some faulty RAM slots so updates will have a slightly faster turnaround again.

@bcdarwin
Copy link
Member

bcdarwin commented Feb 2, 2020

fwiw, 1.4 is now out.

@danieldk
Copy link
Contributor

danieldk commented Feb 3, 2020

fwiw, 1.4 is now out.

There may soon also be a 1.4.1 release that is compatible with gcc 9:

pytorch/pytorch#32277

@stites stites changed the title pytorch: 1.2.0 -> 1.3.1 pytorch: 1.2.0 -> 1.4.0 Feb 3, 2020
@stites
Copy link
Member Author

stites commented Feb 3, 2020

@bcdarwin @danieldk -- I've bumped this PR to 1.4.0. I'm fairly confident this derivation remains bug-free since I ran the full test suite on the alpha branch a few days before the 1.4.0 release. Unfortunately, I don't have time to run the entire build matrix again.

@jonringer
Copy link
Contributor

@GrahamcOfBorg build python2Packages.pytorch
@GrahamcOfBorg build python3Packages.pytorch
@GrahamcOfBorg build python38Packages.pytorch

@jonringer
Copy link
Contributor

please address failures :(

@jonringer
Copy link
Contributor

jonringer commented Feb 3, 2020

looks like a lot of failures are related to -fpermissive. I forgot the exact steps, but you can lower it from an error to a warning.

../c10/util/LeftRight.h:67:10: error: ‘typename std::result_of<F(const T&)>::type c10::LeftRight<T>::read(F&&) const [with F = c10::Dispatcher::doCallUnboxedOnly(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = std::tuple<at::Tensor, at::Tensor>; Args = {const at::Tensor&, long int, long int, bool, bool}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>; T = ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>; typename std::result_of<F(const T&)>::type = std::tuple<at::Tensor, at::Tensor>]’, declared using local type ‘c10::Dispatcher::doCallUnboxedOnly(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = std::tuple<at::Tensor, at::Tensor>; Args = {const at::Tensor&, long int, long int, bool, bool}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>’, is used but never defined [-fpermissive]
../c10/util/LeftRight.h:67:10: error: ‘typename std::result_of<F(const T&)>::type c10::LeftRight<T>::read(F&&) const [with F = c10::Dispatcher::doCallUnboxed(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = at::Tensor; Args = {const at::Tensor&, c10::Scalar, long int, c10::Scalar}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>; T = ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>; typename std::result_of<F(const T&)>::type = at::Tensor]’, declared using local type ‘c10::Dispatcher::doCallUnboxed(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = at::Tensor; Args = {const at::Tensor&, c10::Scalar, long int, c10::Scalar}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>’, is used but never defined [-fpermissive]

@bcdarwin
Copy link
Member

bcdarwin commented Feb 3, 2020

On x86 Ubuntu with Cuda support, the package builds but is broken:

$ nix-shell -I nixpkgs=~/nixpkgs -p 'python3.withPackages (ps: with ps; [ pytorch ])' --pure --run 'python3 "import torch"'
...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nix/store/3mkr7x854wslwpdl0zv9kg4k1k8icirc-python3-3.7.5-env/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: /nix/store/3mkr7x854wslwpdl0zv9kg4k1k8icirc-python3-3.7.5-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

Possibly setting USE_SYSTEM_NCCL=true doesn't work with the current NCCL/Pytorch.

Seems fine without Cuda but didn't run the test suite.

@bhipple bhipple mentioned this pull request May 3, 2020
10 tasks
@bhipple
Copy link
Contributor

bhipple commented May 3, 2020

@GrahamcOfBorg build python37Packages.pytorch

@stites I pushed an update to your PR with the following changes:

  • Pass blas.provider into buildInputs, so that CMake can find the actual
    mkl for inspection of its cmake files and headers.

  • Add USE_MKL correctly when the blas provider is mkl.

  • Use the MKLDNN and MKLDNN_CBLAS flags by default, since mkldnn is FOSS and
    always available..

  • Remove a patch for MKL 2019, since we've moved to 2020.

  • Set doCheck to true; this does not actually appear to take as long as it
    used to. (10-15 minutes for me.)

  • Removed some unused variables at the top of the file

I've managed to build this successfully with both the FOSS stack and with mkl as the blas provider; take a look and LMK what you think.

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about regressions, but at the very least, we should disable for python38

builder for '/nix/store/a70hd44hmh22wxsr0w7sl3wnrfgfxpjb-python3.7-ignite-0.2.1.drv' failed with exit code 1; last 10 log lines:
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
    /nix/store/m4i2i1ax2ch5y1ql8ng7f014mrmk63ja-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:342: DeprecationWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
      "please use `get_last_lr()`.", DeprecationWarning)

  -- Docs: https://docs.pytest.org/en/latest/warnings.html
  ===== 4 failed, 275 passed, 4 skipped, 72 deselected, 20 warnings in 6.45s =====
builder for '/nix/store/zyxn5nxx22jwdj5xhvxx173ngkms6ryi-python3.8-pytorch-1.4.1.drv' failed with exit code 1; last 10 log lines:
  ----------------------------------------------------------------------
  Ran 2276 tests in 93.630s

  FAILED (failures=1, skipped=66, expected failures=1)
  Traceback (most recent call last):
    File "test/run_test.py", line 455, in <module>
      main()
    File "test/run_test.py", line 448, in main
      raise RuntimeError(message)
  RuntimeError: test_jit failed!
cannot build derivation '/nix/store/iraap11hf3k4yax6bzpmnmrif3pzli5l-python3.8-ignite-0.2.1.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/8qdsp4ga63vmwbq8w0b1vf33avb848rx-python3.8-tensorly-0.4.5.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/qkmv9p848yh1l62cn9yy3wb6id0hh1ac-python3.8-torchvision-0.2.1.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/11waqaqv5xq4276f1npv1dmpj31hgprx-python3.8-pywick-0.5.6.drv': 2 dependencies couldn't be built
builder for '/nix/store/c8jvyrqblrk4a95rlnv5ji142d7sv33m-python3.7-pytorch-1.4.1.drv' failed with exit code 1; last 10 log lines:
  wrapping `/nix/store/28cbbgvg7mm81l4q7qac7pmmqycjizj2-python3.7-pytorch-1.4.1/bin/convert-caffe2-to-onnx'...
  Executing pythonRemoveTestsDir
  Finished executing pythonRemoveTestsDir
  running install tests
  Traceback (most recent call last):
    File "test/run_test.py", line 14, in <module>
      import torch
    File "/nix/store/28cbbgvg7mm81l4q7qac7pmmqycjizj2-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
      from torch._C import *
  ImportError: /nix/store/28cbbgvg7mm81l4q7qac7pmmqycjizj2-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
builder for '/nix/store/f2lxq1zcw1zbc4q3snnskdaadm5vyw2x-python3.8-pytorch-1.4.1.drv' failed with exit code 1; last 10 log lines:
  wrapping `/nix/store/q9diyfdsl34cy1iqcnkg7c6g650aif52-python3.8-pytorch-1.4.1/bin/convert-caffe2-to-onnx'...
  Executing pythonRemoveTestsDir
  Finished executing pythonRemoveTestsDir
  running install tests
  Traceback (most recent call last):
    File "test/run_test.py", line 14, in <module>
      import torch
    File "/nix/store/q9diyfdsl34cy1iqcnkg7c6g650aif52-python3.8-pytorch-1.4.1/lib/python3.8/site-packages/torch/__init__.py", line 81, in <module>
      from torch._C import *
  ImportError: /nix/store/q9diyfdsl34cy1iqcnkg7c6g650aif52-python3.8-pytorch-1.4.1/lib/python3.8/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
cannot build derivation '/nix/store/rqcqydv29f29gn957kfb5vysx4ppgnsv-env.drv': 8 dependencies couldn't be built
[0 built (4 failed), 0.0 MiB DL]
error: build of '/nix/store/rqcqydv29f29gn957kfb5vysx4ppgnsv-env.drv' failed
https://github.com/NixOS/nixpkgs/pull/75827
2 package marked as broken and skipped:
python37Packages.pyro-ppl python38Packages.pyro-ppl

8 package failed to build:
python37Packages.ignite python37Packages.pytorchWithCuda python38Packages.ignite python38Packages.pytorch python38Packages.pytorchWithCuda python38Packages.pywick python38Packages.tensorly python38Packages.torchvision

4 package built:
python37Packages.pytorch python37Packages.pywick python37Packages.tensorly python37Packages.torchvision

@bhipple
Copy link
Contributor

bhipple commented May 5, 2020

Hmm, I think I'll go back to just not running the test suite -- particularly since users of MKL may be running without a binary cache and need to wait for the package to build themselves.

@jonringer
Copy link
Contributor

or just reduce the checkphase to something simple. I would be fine with some pythonImportsCheck = [ "torch" ];.

As long as the maintainer verified that it worked for more in-depth cases locally

@stites
Copy link
Member Author

stites commented May 8, 2020

👍 thanks for bumping this pr @bhipple !

stites and others added 2 commits May 9, 2020 13:30
Co-authored-by: Benjamin Hipple <bhipple@protonmail.com>
- Pass `blas.provider` into `buildInputs`, so that CMake can find the actual
  `mkl` for inspection of its cmake files and headers.

- Add `USE_MKL` correctly when the blas provider is `mkl`.

- Use the MKLDNN and MKLDNN_CBLAS flags by default, since `mkldnn` is FOSS and
  always available..

- Remove a patch for MKL 2019, since we've moved to 2020.

- Add a pythonImportsCheck for "torch" as a basic sanity-check

- Removed some unused variables at the top of the file
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

Result of nixpkgs-review pr 75827 1

2 packages marked as broken and skipped:
- python37Packages.pyro-ppl
- python38Packages.pyro-ppl
4 packages failed to build:
- python37Packages.ignite
- python37Packages.pytorchWithCuda
- python38Packages.ignite
- python38Packages.pytorchWithCuda
8 packages built:
- python37Packages.pytorch (python37Packages.pytorchWithoutCuda)
- python37Packages.pywick
- python37Packages.tensorly
- python37Packages.torchvision
- python38Packages.pytorch (python38Packages.pytorchWithoutCuda)
- python38Packages.pywick
- python38Packages.tensorly
- python38Packages.torchvision

@bhipple bhipple changed the title pytorch: 1.2.0 -> 1.4.1 python3Packages.pytorch: 1.2.0 -> 1.4.1, python3Packages.ignite: 0.2.1 -> 0.3.0 May 9, 2020
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

pytorchWithCuda is broken on master; help requested, but since it's not a regression relative to this PR, and since this PR has significant improvements to pytorch, I propose merging as-is and fixing cuda in another PR.

ignite required an upgrade to keep it building.

@GrahamcOfBorg eval

@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

One last change: in #85839 I changed the name of dnnl to oneDNN to track an upstream rename. That PR has a back-compat alias as dnnl, but it looks like OfBorg builds its eval without aliases (which is probably a good thing in general).

Updated pytorch to reference dnnl by the new name, but made no other material change.

@ofborg ofborg bot requested a review from bcdarwin May 9, 2020 18:55
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

Result of nixpkgs-review pr 75827 1

2 packages failed to build:
- python37Packages.pytorchWithCuda
- python38Packages.pytorchWithCuda
10 packages built:
- python37Packages.ignite
- python37Packages.pytorch (python37Packages.pytorchWithoutCuda)
- python37Packages.pywick
- python37Packages.tensorly
- python37Packages.torchvision
- python38Packages.ignite
- python38Packages.pytorch (python38Packages.pytorchWithoutCuda)
- python38Packages.pywick
- python38Packages.tensorly
- python38Packages.torchvision

@bhipple bhipple merged commit 3d9f3c3 into NixOS:master May 9, 2020
@bhipple bhipple mentioned this pull request May 9, 2020
10 tasks
@Zhen-hao
Copy link

Zhen-hao commented May 9, 2020

great work! thank you all!
I've been waiting for this

@Zhen-hao
Copy link

what is the best way to get pytorchWithCuda working on NixOS?
currently, the build fails with

    from torch._C import *
ImportError: /nix/store/a29i85licwqsbvysansiyd67zwlkianb-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

for now I will try to follow #75827 (comment)

@jonringer
Copy link
Contributor

that looks like a linking error to cuda. Not super familiar with the cuda toolchain to know where an assumption is being broken

@bhipple
Copy link
Contributor

bhipple commented May 10, 2020

If you manage to get it working, please do send a PR!

@bhipple
Copy link
Contributor

bhipple commented May 12, 2020

Interesting note: when we upgraded pytorch from 1.0.0 -> 1.2.0, we must've leaked a proprietary dependency into the default expression, because Hydra stopped building it. That's now been fixed with this PR, so we once again have binary cache builds of the default pytorch:

https://hydra.nixos.org/job/nixpkgs/trunk/python38Packages.pytorch.x86_64-linux/all

@Zhen-hao
Copy link

what is the best way to get pytorchWithCuda working on NixOS?
currently, the build fails with

    from torch._C import *
ImportError: /nix/store/a29i85licwqsbvysansiyd67zwlkianb-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

for now I will try to follow #75827 (comment)

I tried a few things but couldn't make it work with the current version.
However, it builds with CUDA when I change the version to 1.5.0 and remove the patches list. I tested it with the examples in transformers' quickstart guide and it worked fine with my GPU. My build uses cudnn_cudatoolkit_10_2.
I think it was bug in pytorch 1.4.1 and they fixed in 1.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet