Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow: update to 2.3.0 #95824

Merged

Conversation

matthewbauer
Copy link
Member

@matthewbauer matthewbauer commented Aug 20, 2020

Motivation for this change
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@drewrisinger
Copy link
Contributor

AFAICT touches similar files as #95736 (probably supersedes). That one only updates to 2.2.0, and also adds a new package which should be able to be separated out.

@danieldk
Copy link
Contributor

danieldk commented Aug 20, 2020

Seems like the scipy version bound also needs to be relaxed:

Executing pipInstallPhase                                                                      
/build/tensorflow-2.3.0-python/dist /build/tensorflow-2.3.0-python         
Processing ./tensorflow-2.3.0-cp38-cp38-linux_x86_64.whl
ERROR: Could not find a version that satisfies the requirement scipy==1.4.1 (from tensorflow==2.3.0) (from versions: none)
ERROR: No matching distribution found for scipy==1.4.1 (from tensorflow==2.3.0)
builder for '/nix/store/wk0lcdb4cp8hhzdcp4ibqddzv2xkv85m-python3.8-tensorflow-2.3.0.drv' failed with exit code 1
error: build of '/nix/store/wk0lcdb4cp8hhzdcp4ibqddzv2xkv85m-python3.8-tensorflow-2.3.0.drv' failed

@raboof
Copy link
Member

raboof commented Aug 21, 2020

Needed a ton of memory to build, but after that indeed seems to work!

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of noise, don't know what's a regression

https://github.com/NixOS/nixpkgs/pull/95824
40 packages failed to build:
poretools python27Packages.Keras python27Packages.annoy python27Packages.awkward python27Packages.h5py python27Packages.h5py-mpi python27Packages.keras-applications python27Packages.phonopy python27Packages.tensorflow-bin python27Packages.tensorflow-bin_2 python27Packages.uproot python27Packages.uproot-methods python27Packages.worldengine python37Packages.arviz python37Packages.baselines python37Packages.clifford python37Packages.dm-sonnet python37Packages.edward python37Packages.graph_nets python37Packages.mask-rcnn python37Packages.optuna python37Packages.phonopy python37Packages.pymc3 python37Packages.qasm2image python37Packages.qiskit python37Packages.qiskit-aqua python37Packages.rl-coach python37Packages.sumo python37Packages.tensorflow python37Packages.tensorflow-bin python37Packages.tensorflow-bin_2 python37Packages.tensorflow-probability python37Packages.tensorflowWithCuda python37Packages.tflearn python38Packages.clifford python38Packages.phonopy python38Packages.qasm2image python38Packages.qiskit python38Packages.qiskit-aqua python38Packages.sumo

51 packages built:
python27Packages.tensorflow-estimator_2 python37Packages.Keras python37Packages.annoy python37Packages.awkward python37Packages.bayespy python37Packages.caffe python37Packages.dcmstack python37Packages.dicom2nifti python37Packages.dipy python37Packages.h5netcdf python37Packages.h5py python37Packages.h5py-mpi python37Packages.heudiconv python37Packages.hickle python37Packages.keras-applications python37Packages.nibabel python37Packages.nilearn python37Packages.nipy python37Packages.nipype python37Packages.nitime python37Packages.pybids python37Packages.pywick python37Packages.tensorflow_2 python37Packages.tensorflow-estimator_2 python37Packages.uproot python37Packages.uproot-methods python37Packages.worldengine python38Packages.Keras python38Packages.annoy python38Packages.awkward python38Packages.bayespy python38Packages.caffe python38Packages.dicom2nifti python38Packages.dipy python38Packages.h5netcdf python38Packages.h5py python38Packages.h5py-mpi python38Packages.hickle python38Packages.keras-applications python38Packages.nibabel python38Packages.nilearn python38Packages.nipy python38Packages.nipype python38Packages.pybids python38Packages.pywick python38Packages.tensorflow_2 python38Packages.tensorflow-estimator_2 python38Packages.uproot python38Packages.uproot-methods worldengine-cli visidata

@jonringer
Copy link
Contributor

looks like the h5py bump is the cause of a lot of the regressions for python2, but I don't really care about keep python2 green...

@danieldk
Copy link
Contributor

Builds succesfully with CUDA. Our GPUs are too busy to test anything serious training, but a small sanity check works. (We do use impure CUDA dependencies though, since the machine is running Ubuntu.)

Copy link
Contributor

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with CUDA 11, python37 and python38, trained a small model, worked perfectly!

@jonringer
Copy link
Contributor

the changes look good, but do you mind cleaning up the git history?
I looks like ~7 actual changes.

Also, the commit messages should adhere to CONTRIBUTING.md, specifically (my template):

To comply with CONTRIBUTING.md please have the commit message name be of the format

<pkg-name>: <subject-line>

for more examples, please look at https://github.com/NixOS/nixpkgs/blob/master/.github/CONTRIBUTING.md#submitting-changes

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is or isn't noise:

https://github.com/NixOS/nixpkgs/pull/95824
25 packages failed to build:
python37Packages.arviz python37Packages.baselines python37Packages.clifford python37Packages.dm-sonnet python37Packages.edward python37Packages.graph_nets python37Packages.hickle python37Packages.mask-rcnn python37Packages.optuna python37Packages.phonopy python37Packages.pymc3 python37Packages.pywick python37Packages.rl-coach python37Packages.sumo python37Packages.tensorflow python37Packages.tensorflow-bin python37Packages.tensorflow-bin_2 python37Packages.tensorflow-probability python37Packages.tensorflowWithCuda python37Packages.tflearn python38Packages.clifford python38Packages.hickle python38Packages.phonopy python38Packages.pywick python38Packages.sumo

53 packages built:
python27Packages.tensorflow-estimator_2 python37Packages.Keras python37Packages.annoy python37Packages.awkward python37Packages.bayespy python37Packages.caffe python37Packages.dcmstack python37Packages.dicom2nifti python37Packages.dipy python37Packages.h5netcdf python37Packages.h5py python37Packages.h5py-mpi python37Packages.heudiconv python37Packages.keras-applications python37Packages.nibabel python37Packages.nilearn python37Packages.nipy python37Packages.nipype python37Packages.nitime python37Packages.pybids python37Packages.qasm2image python37Packages.qiskit python37Packages.qiskit-aqua python37Packages.tensorflow_2 python37Packages.tensorflow-estimator_2 python37Packages.uproot python37Packages.uproot-methods python37Packages.worldengine python38Packages.Keras python38Packages.annoy python38Packages.awkward python38Packages.bayespy python38Packages.caffe python38Packages.dicom2nifti python38Packages.dipy python38Packages.h5netcdf python38Packages.h5py python38Packages.h5py-mpi python38Packages.keras-applications python38Packages.nibabel python38Packages.nilearn python38Packages.nipy python38Packages.nipype python38Packages.pybids python38Packages.qasm2image python38Packages.qiskit python38Packages.qiskit-aqua python38Packages.tensorflow_2 python38Packages.tensorflow-estimator_2 python38Packages.uproot python38Packages.uproot-methods worldengine-cli visidata

@danieldk
Copy link
Contributor

not sure what is or isn't noise:

Seems like all of these are misfires because nixpkgs-review believes that the tensorflow attribute is also affected by this PR? It seems like we are not really using Tensorflow 2 anywhere yet:

❯ git grep tensorflow_2
pkgs/top-level/python-packages.nix:  tensorflow_2 = self.tensorflow-build_2;
❯ git grep tensorflow-build_2
pkgs/servers/home-assistant/parse-requirements.py:    "tensorflow-build_2": "tensorflow",
pkgs/top-level/python-packages.nix:  tensorflow-build_2 = callPackage ../development/python-modules/tensorflow/2 {
pkgs/top-level/python-packages.nix:  tensorflow_2 = self.tensorflow-build_2;

(Unless I missed something)

@jonringer
Copy link
Contributor

(Unless I missed something)

There's 16 commits. h5py, gets changed as well, and tensorflow_1 has a dependency on it through keras

@danieldk
Copy link
Contributor

danieldk commented Aug 28, 2020

There's 16 commits. h5py, gets changed as well, and tensorflow_1 has a dependency on it through keras

Ah right! Checking Hydra:

  • arviz fails because tensorflow1 fails.
  • baseline fails, because gym fails due to a missing wheel dependency or wrong constraint.
  • clifford fails, due to a missing wheel dependency or wrong constraint.
  • dm-sonnet fails because tensorflow1 fails.
  • edward fails because tensorflow1 fails.
  • graph_nets fails because tensorflow1 fails.
  • hickle fails because astropy is missing as a dependency (or has wrong contraints).
  • mask-rcnn fails because tensorflow1 fails.
  • optuna fails because the import check for the datatable dependency fails
  • phonopy fails because spglib is missing as a dependency (or has wrong contraints).
  • pymc3 fails because tensorflow1 fails.
  • pywick fails because hickle fails.
  • rl-coach is already marked broken.
  • sumo fails because phonopy fails.
  • tensorflow fails because it requires an older version of numpy

Of course, the h5py bump could have caused extra breakage that we do not see for the existing reasons.

@jyp
Copy link
Contributor

jyp commented Sep 8, 2020

This version does not seem to detect the GPU: the test give the following warning

running install tests
2020-09-08 14:18:26.149228: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2020-09-08 14:18:26.149249: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

@FRidh
Copy link
Member

FRidh commented Sep 8, 2020

tensorflow_1 should still be removed because its broken (numpy version) and can't be fixed

@danieldk
Copy link
Contributor

danieldk commented Sep 8, 2020

tensorflow_1 should still be removed because its broken (numpy version) and can't be fixed

I made a PR to mark tensorflow(-bin)?_1 broken, #96288. We should then probably check which packages that depend on Tensorflow 1 have an upgrade path to Tensorflow 2. Though that would be easier to do if this PR is finished and then merged.

@danieldk
Copy link
Contributor

danieldk commented Sep 8, 2020

This version does not seem to detect the GPU: the test give the following warning

running install tests
2020-09-08 14:18:26.149228: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2020-09-08 14:18:26.149249: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Interesting, I am getting that error as well, but it works fine after the build. The library has the correct run path:

~/git/nixpkgs % nix-store -qR result | grep tensorflow
/nix/store/9izlv1wfn49vlyj7j8qs8m494xmfxk2g-tensorflow-gpu-2.3.0
/nix/store/mrixac4pmfcbw1fxrr4yz21hvkkw3y5y-python3.8-tensorflow-estimator-2.3.0
/nix/store/crl90k2mx6jwb8f83ig6x4nqlj4inw4z-python3.8-tensorflow-gpu-2.3.0
~/git/nixpkgs % readelf -d /nix/store/crl90k2mx6jwb8f83ig6x4nqlj4inw4z-python3.8-tensorflow-gpu-2.3.0/lib/python3.8/site-packages/tensorflow/libtensorflow_framework.so.2 | grep RUNPATH
 0x000000000000001d (RUNPATH)            Library runpath: [/nix/store/p17p07xik81wk1x29a2090ch16v0rnas-cudatoolkit-11.0.3/lib:/nix/store/q2fff3393cbpg9lxplzqy36w450z57yg-cudatoolkit-11.0.3-lib/lib:/nix/store/avccwb3n8qp4pn9f2nrl47vfg1dycjf7-cudatoolkit-11.0-cudnn-8.0.2/lib:/nix/store/94f02852kwlckrm3m2f319l34mgk6h07-nccl-2.7.8-1-cuda-11.0/lib:/run/opengl-driver/lib:$ORIGIN/:/nix/store/2pi6zgkwnr3zdslvlv16nixpzvbyjx1n-glibc-2.31/lib:/nix/store/pb463zhmbbldgwrb6v79cpq07drwj8f0-gcc-9.3.0-lib/lib]

Not sure why it fails during the check phase, maybe it is not loading the module from $out and therefor doesn't use the patchelf-fixed version?

At any rate, even if this is fixed, the check would fail to run on the GPU, because /dev/nvidia* are not available in the sandbox.

Could you try if the module works with your GPU after the build?

@jyp
Copy link
Contributor

jyp commented Sep 10, 2020

@danieldk You're right, it is in fact able to find some CUDA libraries in my production environment. Unfortunately, this is not the version which corresponds to my kernel.
@matthewbauer Previously, I was able to override the version of CUDA used by tensorflow by overriding nvidia_x11 in linuxPackages. I was able to make it work by putting the appropriate links in /run/opengl-driver, but I am still getting the error:

tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Any idea could be helpful here.

@danieldk
Copy link
Contributor

Any idea could be helpful here.

Overriding should still be possible (didn't test it though), but you have to override cudatoolkit, cudnn, and nccl. Something like:

tensorflow_2.override {
    cudatoolkit = pkgs.cudatoolkit_10_1;
    cudnn = pkgs.cudnn_cudatoolkit_10_1;
    nccl = pkgs.nccl_cudatoolkit_10;
}

raboof and others added 4 commits September 10, 2020 23:34
Also:
- patch to remove scipy requirement
- add cuda to RPATH
- don’t include nvidia_x11 (This isn’t needed, we can get it from
/run/opengl-driver being in the RPATH.)

Co-authored-by: Arnout Engelen <arnout@bzzt.net>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
also disable on python 2.7
Co-authored-by: Jon <jonringer@users.noreply.github.com>
@matthewbauer matthewbauer force-pushed the python3.tensorflow_2-update-to-2.3.0 branch from d8ec6b4 to 59eecac Compare September 11, 2020 04:37
@matthewbauer
Copy link
Member Author

Updated with proper commit message formatting, and squashed.

@jyp Perhaps we should stay on cudatoolkit-10? I think that will resolve your issue; minimum nvidia driver is 450.51.06. This was added in 9edafbc.

@jyp
Copy link
Contributor

jyp commented Sep 15, 2020

I put libcuda.so.1 in /run/opengl-driver/lib:

$ ls -l /run/opengl-driver/lib/
total 0
lrwxrwxrwx. 1 root root 19 11 sep 09.36 libcuda.so.1 -> /lib64/libcuda.so.1

and it appears that the library can be loaded, but then it fails with:

Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error

Am I forgetting something in /run/opengl-driver?
Perhaps I forgot some links in /run

Edit: also, I did not override the cudatoolkit version.

@danieldk
Copy link
Contributor

danieldk commented Sep 19, 2020

On NixOS, RTX 2060 Super:

2020-09-19 19:57:08.549056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:2d:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.695GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-09-19 19:57:08.549072: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2020-09-19 19:57:08.549083: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-09-19 19:57:08.549093: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-19 19:57:08.549101: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-19 19:57:08.549110: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-19 19:57:08.549119: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2020-09-19 19:57:08.549128: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-09-19 19:57:08.549168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-19 19:57:08.549480: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-19 19:57:08.549759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-19 19:57:08.549776: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nix/store/3ycm0sz7x91n5dkl41pwfzqw5ifbfgv2-python3-3.8.5-env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 457, in _method_wrapper
[...]
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

I am going to compile with the same capabilities as upstream and see if that fixes this.

@danieldk
Copy link
Contributor

danieldk commented Sep 20, 2020

I am going to compile with the same capabilities as upstream and see if that fixes this.

That worked, so maybe we should switch to the same capabilities that upstream currently uses:

, cudaCapabilities ? [ "sm_35" "sm_50" "sm_60" "sm_70" "sm_75" "compute_80" ]

(We currently have 3.5 and 5.2. Even 5.2 is pretty old by now.)

@mkg20001 mkg20001 mentioned this pull request Sep 23, 2020
10 tasks
@mkg20001
Copy link
Member

There's also #98522 now which could be included in here which upgrades the tensorflow bin derivation

@mkg20001
Copy link
Member

mkg20001 commented Sep 23, 2020

Currently build fails with that on #98522

ERROR: Could not find a version that satisfies the requirement tensorboard<3,>=2.3.0 (from tensorflow-cpu==2.3.0) (from versions: none)
ERROR: No matching distribution found for tensorboard<3,>=2.3.0 (from tensorflow-cpu==2.3.0)

For that PR so both would have to be upgraded simultaneously

@mkg20001
Copy link
Member

There's also #95736 which upgrades bin to 2.2.0, but has some other changes

@danieldk
Copy link
Contributor

@jonringer I propose that we merge this now. This works on NixOS and it would be nice if we could backport this to 20.03 to have a working Tensorflow 2. Remaining issues:

@mkg20001
Copy link
Member

Ok, considering this is getting stale repeatedly and the current drv is broken anyways, I'm merging in 3 days

Any objections?

@danieldk
Copy link
Contributor

Let's do this: (1) it's a large improvement over master (which does not build); (2) the only other affected derivation is h5py; (3) 2.3.1 is out already (which also fixes a lot of CVEs); (4) multiple people have gone over this PR and @matthewbauer has addressed all the comments.

@danieldk danieldk merged commit b751c12 into NixOS:master Sep 30, 2020
@jonringer jonringer added the 8.has: port to stable A PR already has a backport to the stable release. label Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants