Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for CUDA and Tensorflow update #31507

Merged
merged 3 commits into from Feb 21, 2018
Merged

Conversation

abbradar
Copy link
Member

Motivation for this change

Fix problems with certain usage of cudatoolkit 8 and update TensorFlow (which depends on it). This PR also depends on #31497 -- I plan to merge as soon as it goes in master.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option build-use-sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Fits CONTRIBUTING.md.

cc @cstrahan -- you may be interested in new buildBazelPackage. I hope we can incorporate your future Bazel fixes there. I also plan to build TensorBoard with it in future (that's why it was made) but it turns out TensorBoard is even more messy than TensorFlow :D

@CMCDragonkai
Copy link
Member

Cool, I'll try this once it is merged, when this is merged, does it come with tensorboard as well!?

@abbradar
Copy link
Member Author

@CMCDragonkai Yes,it includes an update.

@jyp
Copy link
Contributor

jyp commented Nov 15, 2017

Unfortunately I could not build tensorflow/cuda with this PR. Here is the tail of the log:

Extracting Bazel installation...
You have bazel 0.7.0- (@non-git) installed.
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL support? [y/N]: No OpenCL support will be enabled for TensorFlow.

Do you want to use clang as CUDA compiler? [y/N]: nvcc will be used as CUDA compiler.

Please specify the MPI toolkit folder. [Default is /nix/store/ky47lxj3wdzij8lc8hz5gx3sx04247yh-openmpi-1.10.7]: 

Add "--config=mkl" to your bazel command to build with MKL support.
Please note that MKL on MacOS or windows is still not supported.
If you would like to use a local MKL instead of downloading, please set the environment variable "TF_MKL_ROOT" every time before build.
Configuration finished
building
...........
WARNING: Config values are not defined in any .rc file: opt, cuda.
____Loading package: tensorflow/tools/pip_package
ERROR: error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package '@local_config_cuda//cuda': Traceback (most recent call last):
	File "/tmp/nix-build-tensorflow-build-1.4.0-deps.drv-0/source/third_party/gpus/cuda_configure.bzl", line 1042
		_create_local_cuda_repository(repository_ctx)
	File "/tmp/nix-build-tensorflow-build-1.4.0-deps.drv-0/source/third_party/gpus/cuda_configure.bzl", line 905, in _create_local_cuda_repository
		_get_cuda_config(repository_ctx)
	File "/tmp/nix-build-tensorflow-build-1.4.0-deps.drv-0/source/third_party/gpus/cuda_configure.bzl", line 660, in _get_cuda_config
		_cuda_version(repository_ctx, cuda_toolkit_path, c...)
	File "/tmp/nix-build-tensorflow-build-1.4.0-deps.drv-0/source/third_party/gpus/cuda_configure.bzl", line 273, in _cuda_version
		auto_configure_fail(("Error running nvcc --version: ...))
	File "/tmp/nix-build-tensorflow-build-1.4.0-deps.drv-0/source/third_party/gpus/cuda_configure.bzl", line 129, in auto_configure_fail
		fail(("\n%sCuda Configuration Error:%...)))

Cuda Configuration Error: Error running nvcc --version: /nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/3cv0q9g9zyg5alzzz6072rhdmvzhk5yy-cuda-floatn.h/include': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `-isystem': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/3cv0q9g9zyg5alzzz6072rhdmvzhk5yy-cuda-floatn.h/include': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `-isystem': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/include': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `-isystem': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/xxkapzlxm7jfai8864npvdnkqrb7dqcj-jemalloc-4.5.0/include': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `-isystem': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/ky47lxj3wdzij8lc8hz5gx3sx04247yh-openmpi-1.10.7/include': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `-isystem': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/gpyaky34fdzbyvnj05lll4j2y9r6hrzz-cudatoolkit-8.0.61/include': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `-isystem': not a valid identifier
/nix/store/dmvacf0g2rma18a4wcq3f1a14sharwvf-cudatoolkit-8.0.61-unsplit/bin/nvcc: line 3: export: `/nix/store/p179s2c1s96vgayzigyvnlscgihmsn27-cudatoolkit-8.0-cudnn-6.0/include': not a valid identifier

.
builder for ‘/nix/store/10zzbjkizj4966rg1xzm3k3fycrcfib9-tensorflow-build-1.4.0-deps.drv’ failed with exit code 2
cannot build derivation ‘/nix/store/p6ca3y0mnyjwr4d4k4bp78xaqs4wxmah-tensorflow-build-1.4.0.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/awxj1rnxbffhfxmjn4yg6x8f1xjwnj6q-python3.6-tensorflow-1.4.0.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/16kdv1qcijp464rhh5x0ggb65fsy09wp-python3-3.6.3-env.drv’: 1 dependencies couldn't be built
error: build of ‘/nix/store/16kdv1qcijp464rhh5x0ggb65fsy09wp-python3-3.6.3-env.drv’ failed
/usr/local/bin/nix-shell: failed to build all dependencies

@abbradar
Copy link
Member Author

abbradar commented Nov 15, 2017 via email

@jyp
Copy link
Contributor

jyp commented Nov 15, 2017 via email

@jyp
Copy link
Contributor

jyp commented Nov 20, 2017

My new build attempt fails with

output path ‘/nix/store/bwpd405zwcq15zirf8khd6i6gv6lw4ha-tensorflow-build-1.4.0-deps’ has r:sha256 hash ‘10k7i61ya33dcy98i0s7r8f1d4s4rwjl5myfyiyr46skjpzydxdv’ when ‘0sq0a7vsajzqwxgg82xw1q74n7vdq37n9d5z7p0c8gzpmyw7mgc9’ was expected

@abbradar
Copy link
Member Author

abbradar commented Nov 20, 2017 via email

@andersk
Copy link
Contributor

andersk commented Dec 23, 2017

Are you sure it’s a determinism problem? I just tried this and got the same error with the same hash as @jyp:

output path ‘/nix/store/bwpd405zwcq15zirf8khd6i6gv6lw4ha-tensorflow-build-1.4.0-deps’ has r:sha256 hash ‘10k7i61ya33dcy98i0s7r8f1d4s4rwjl5myfyiyr46skjpzydxdv’ when ‘0sq0a7vsajzqwxgg82xw1q74n7vdq37n9d5z7p0c8gzpmyw7mgc9’ was expected

@andersk
Copy link
Contributor

andersk commented Feb 5, 2018

@abbradar Still working on this? TensorFlow 1.5 is available now, based on CUDA 9 and cuDNN 7.

@jyp
Copy link
Contributor

jyp commented Feb 5, 2018

@abbradar has not been responding to any message for a couple of months

@abbradar
Copy link
Member Author

Sorry for the long absence, had some real life difficulties.

The hash may be the same for you @andersk but that means that it has changed over time and that is a problem. I can't reproduce it now; I'll rebase this work and check if it works -- we cannot determine nondeterminism cause until we hit it again.

A separate function for building Bazel-bazed packages. Internally it splits the
build into two phases, fetching and building.

Users are expected to provide `fetchArgs.sha256` -- checksum of fetched
dependencies. Local dependencies should be removed in `fetchArgs.preInstall`.
Overall `fetchArgs` and `buildArgs` can be used to add specific steps to fetch
and build.
@abbradar
Copy link
Member Author

I've bumped Tensorflow version to 1.5 and Tensorboard to 1.5.1. Works fine with MNIST test from Keras.

++ lib.optionals (pythonOlder "3.4") [ backports_weakref enum34 ]
++ lib.optional withTensorboard tensorflow-tensorboard;

# For some reason, CUDA is not retained in RPATH.
doInstallCheck = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering in buildPython* the checkPhase corresponds to installCheckPhase, I suggest removing doCheck = false; and setting instead checkPhase (but of course with the comment that the actual tests are slow and impure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

I'm building Python 2 version now just in case, shiny new Ryzen helps but it still takes a while -- I'll update when it's done.

@abbradar
Copy link
Member Author

Tested on Python 2 and Python 3.5 (not on 3.6 because of no Tensorboard support -- but it should work).

@FRidh FRidh merged commit 75df207 into NixOS:master Feb 21, 2018
@FRidh
Copy link
Member

FRidh commented Feb 21, 2018

Thank you!

@CMCDragonkai
Copy link
Member

Will this be ported to 17.09? Or is it unstable only?

@abbradar
Copy link
Member Author

abbradar commented Mar 1, 2018

@CMCDragonkai I don't think we'll port this -- it's a version bump which we don't do unless there's a security or other serious reason. Better to stay wheel-based in release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants