Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pythonPackages.tensorflow: 1.3.1 -> 1.4.1 #34420

Closed
wants to merge 1 commit into from

Conversation

jyp
Copy link
Contributor

@jyp jyp commented Jan 30, 2018

Also revert to a wheel-based build (the bazel-based build is broken
and it is unclear how to repair it)

Motivation for this change

Two motivations:

  • Tensorflow does not currently build in master
  • Upstream update
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option build-use-sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Fits CONTRIBUTING.md.

@jyp
Copy link
Contributor Author

jyp commented Feb 2, 2018

@FRidh Would you think that this version is acceptable? (@abbradar still has not given any sign of activity)

@lukeadams
Copy link
Contributor

lukeadams commented Feb 4, 2018

@jyp Apparently Arch is able to build 1.5x from git. Might try to dig into it this weekend – when I try to build 1.5, bazel throws errors about infinite symlinks due to the output directory being in the source tree. Setting the output dir to $(mktemp -d) fixes the first part.
My 1.5x branch
Build log pre outdir fix

Copy link
Contributor

@proger proger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

I did some minor tweaks to make your PR work on Darwin, here's the hello world output:

[nix-shell:~]$ ipython --quick
Python 3.6.4 (default, Jan 17 2018, 00:49:01) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf
/nix/store/wc4m3g4grivhy26sdck44jxwkvz1j17n-python3-3.6.4/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)

In [2]: hello = tf.constant('Hello, TensorFlow!')

In [3]: sess = tf.Session()
2018-02-09 18:21:45.788566: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

In [4]: print(sess.run(hello))
b'Hello, TensorFlow!'

The word2vec_basic example seems to work well too.

Tensorboard seems to be broken though:

[nix-shell:~]$ tensorboard
Traceback (most recent call last):
  File "/nix/store/jvckx7illg4s3y64443swlgbsmv0n1w6-python3.6-tensorflow-1.4.1/bin/.tensorboard-wrapped", line 8, in <module>
    from tensorboard.main import main
ModuleNotFoundError: No module named 'tensorboard'

# tensorflow depends on tensorflow_tensorboard, which cannot be
# cudatoolkit is split (see https://github.com/NixOS/nixpkgs/commit/bb1c9b027d343f2ce263496582d6b56af8af92e6)
# However this means that libcusolver is not loadable by tensor flow. So we undo the split here.
cudatoolkit_joined = symlinkJoin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be wrapped into if cudaSupport or smth like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really so? I thought that nix had lazy evaluation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all derivation (or stdenv.mkDerivation?) attributes are forced to become environment variables in a generated build script.

if isPy3k then
dls.mac_py_2_cpu
else
dls.mac_py_3_cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a mixup here.

cudatoolkit = pkgs.cudatoolkit8;
cudnn = pkgs.cudnn6_cudatoolkit8;
cudatoolkit = pkgs.cudatoolkit9;
cudnn = pkgs.cudnn_cudatoolkit9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove these entirely so my darwin build goes through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from master --- I'd suggest to fix this in another patch.

@lukeadams
Copy link
Contributor

@jyp (continuation from #31492)

I have another PR with 1.4 which should be mergeable now. If you support the revert to a wheel-based build please comment about it in the PR.

I'm not against getting this merged in the short term, but ideally we would trace down why the bazel build fails (source build would make #31046 straightforward to resolve). Easy to undo this if/when we get build from source working.

This method is nice since Bazel on Darwin is broken.

@jyp
Copy link
Contributor Author

jyp commented Feb 10, 2018

@lukeadams "Ideally" yes. If someone can fix it I'll be happy to see the bazel build restored. In practice tensorflow has been broken for three months in master now, hence this PR.

@jyp
Copy link
Contributor Author

jyp commented Feb 10, 2018

@proger Tensorboard is incompatible with python 3.6. tensorflow/tensorboard#427

@jyp jyp force-pushed the tensorflow-wheel branch 2 times, most recently from c7552c0 to 81a868c Compare February 10, 2018 13:34
@jyp
Copy link
Contributor Author

jyp commented Feb 10, 2018

@proger Thanks a lot for your review. I acted on your suggestions but I've left the top-level definition as it was to minimize the diff with master. We can change that one separately. (It's hard enough to get the simplest PR merged.)

@lukeadams
Copy link
Contributor

lukeadams commented Feb 10, 2018

@jyp This patch is all that's necessary to build on Darwin:

From 5cbbb34a43cc431aeb40591042f3d61c31c88085 Mon Sep 17 00:00:00 2001
From: Luke Adams <luke.adams@belljar.io>
Date: Sat, 10 Feb 2018 10:18:50 -0600
Subject: [PATCH] tensorflow: move cudatoolkit_joined to top-level for clarity

allows for simplification of postFixup
---
 .../python-modules/tensorflow/default.nix          | 24 +++++++++++-----------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/pkgs/development/python-modules/tensorflow/default.nix b/pkgs/development/python-modules/tensorflow/default.nix
index 5c5d98e7ace..ff34d02c4fa 100644
--- a/pkgs/development/python-modules/tensorflow/default.nix
+++ b/pkgs/development/python-modules/tensorflow/default.nix
@@ -31,8 +31,18 @@ assert ! (stdenv.isDarwin && cudaSupport);
 # project's build system is an arcane beast based on
 # bazel. Untangling it and building the wheel from source is an open
 # problem.
+let
+  # cudatoolkit is split (see https://github.com/NixOS/nixpkgs/commit/bb1c9b027d343f2ce263496582d6b56af8af92e6)
+  # However this means that libcusolver is not loadable by tensor flow. So we undo the split here.
+  cudatoolkit_joined = symlinkJoin {
+    name = "unsplit_cudatoolkit";
+    paths = [
+      cudatoolkit.out
+      cudatoolkit.lib
+    ];
+   };
 
-buildPythonPackage rec {
+in buildPythonPackage rec {
   pname = "tensorflow";
   version = "1.4.1";
   name = "${pname}-${version}";
@@ -82,17 +92,7 @@ buildPythonPackage rec {
   # patchelf --shrink-rpath will remove the cuda libraries.
   postFixup = let
     rpath = stdenv.lib.makeLibraryPath
-      (if cudaSupport then
-        # cudatoolkit is split (see https://github.com/NixOS/nixpkgs/commit/bb1c9b027d343f2ce263496582d6b56af8af92e6)
-        # However this means that libcusolver is not loadable by tensor flow. So we undo the split here.
-        (let cudatoolkit_joined = symlinkJoin {
-          name = "unsplit_cudatoolkit";
-          paths = [ cudatoolkit.out
-                    cudatoolkit.lib ];};
-         in [ stdenv.cc.cc.lib zlib cudatoolkit_joined cudnn nvidia_x11 ])
-      else
-        [ stdenv.cc.cc.lib zlib ]
-      );
+      ([ stdenv.cc.cc.lib zlib ] ++ lib.optionals cudaSupport [ cudatoolkit_joined cudnn nvidia_x11 ]);
   in
   ''
     rrPath="$out/${python.sitePackages}/tensorflow/:${rpath}"
-- 
2.15.1

Also Tensorboard can be used by using an older python version, e.g.
nix-shell -p python34Packages.tensorflow-tensorboard
I just tested and you can combine this with python36Packages.tensorflow without issue

@jyp
Copy link
Contributor Author

jyp commented Feb 19, 2018

I have applied the patch of @lukeadams

Could this PR be considered for merging? @FRidh ?

Also revert to a wheel-based build (the bazel-based build is broken
and it is unclear how to repair it)
@jyp
Copy link
Contributor Author

jyp commented Feb 26, 2018

Tensorflow is still not building ( #31492 ), even after merging the PR of @abbradar . Can we please merge this PR until the bazel build works? @FRidh ?

@abbradar
Copy link
Member

That's very strange as I'm using it right now. Can you show your error?

@abbradar
Copy link
Member

abbradar commented Feb 26, 2018

Ah, I see failure on Hydra (it was down today morning so I didn't check it first). The error is very strange; for now my theory is that something in nixpkgs has updated since the last successful build (https://hydra.nixos.org/build/69832610). Currently tensorflow doesn't build because of unrelated reasons on master; I'll see if I can reproduce the issue on an older commit.

@jyp
Copy link
Contributor Author

jyp commented Feb 26, 2018

Thanks! Please report progress on #31492, I'll ping you from there.

@lukeadams
Copy link
Contributor

Just wanted to mention: building from source prevents the use of TF on Darwin until #30590 / #32080 are fixed.

@abbradar
Copy link
Member

abbradar commented Feb 27, 2018

@lukeadams Yeah, I want to tackle that next -- or return a wheel-based build specifically for Darwin.

@abbradar
Copy link
Member

abbradar commented Mar 7, 2018

I've tried to set up a Mac OS X VirtualBox machine but apparently AMD processors are not supported, and on an image from https://forum.amd-osx.com/ mouse and keyboard do not work.

@LnL7 Can you perhaps set a Darwin Hydra job for my tensorflow branch again? For now I've reset it to be just master.

@FRidh
Copy link
Member

FRidh commented Mar 7, 2018

@abbradar in case it is not possible to get the source builds working entirely on all supported archs, I think we should still provide the wheel-based solution, at least for 18.03, for all archs.

In that case, @jyp, I recommend including the script you used as an actual script in the tree. Also, instead of bothering to write .nix you can just output .json.

@abbradar
Copy link
Member

abbradar commented Mar 7, 2018

@FRidh Yeah, that's my opinion too -- I thought I'd try to tackle this one on holidays before giving up completely.

@abbradar
Copy link
Member

@jyp I feel too defeated by Darwin for now D: Can you rebase this pull request and make it be used only for Darwin? I can do it by myself if you don't have enough time. Thanks in advance!

@jyp
Copy link
Contributor Author

jyp commented Mar 13, 2018

@abbradar I am completely swamped at the moment so it's great if you can take this (I am not even a darwin user btw). Additionally I wouldn't even know how to apply the advice of @FRidh.

@abbradar
Copy link
Member

@jyp Ah, for some reason I thought you had a Darwin machine too, sorry. I did this myself in #37044

@lukeadams Maybe you can help me test it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants