Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow: 1.5.0 -> 1.8.0 #40689

Closed
wants to merge 1 commit into from
Closed

tensorflow: 1.5.0 -> 1.8.0 #40689

wants to merge 1 commit into from

Conversation

mboes
Copy link
Contributor

@mboes mboes commented May 17, 2018

Motivation for this change

Upgrade to latest Tensorflow. Required by #40525.

cc @zimbatm @xeji

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option build-use-sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Fits CONTRIBUTING.md.

@xeji
Copy link
Contributor

xeji commented May 17, 2018

@GrahamcOfBorg build python36Packages.tensorflow python36Packages.tensorflowWithCuda python27Packages.tensorflow python27Packages.tensorflowWithCuda

@GrahamcOfBorg
Copy link

No attempt on aarch64-linux (full log)

The following builds were skipped because they don't evaluate on aarch64-linux: python36Packages.tensorflow, python36Packages.tensorflowWithCuda, python27Packages.tensorflow, python27Packages.tensorflowWithCuda

Partial log (click to expand)


a) For `nixos-rebuild` you can set
  { nixpkgs.config.allowUnfree = true; }
in configuration.nix to override this.

b) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
  { allowUnfree = true; }
to ~/.config/nixpkgs/config.nix.


@NixOS NixOS deleted a comment from GrahamcOfBorg May 17, 2018
@NixOS NixOS deleted a comment from GrahamcOfBorg May 17, 2018
@NixOS NixOS deleted a comment from GrahamcOfBorg May 17, 2018
@GrahamcOfBorg
Copy link

Success on x86_64-darwin (full log)

Attempted: python36Packages.tensorflow

The following builds were skipped because they don't evaluate on x86_64-darwin: python36Packages.tensorflowWithCuda, python27Packages.tensorflow, python27Packages.tensorflowWithCuda

Partial log (click to expand)

copying path '/nix/store/8jcsbs0m9hklh01mmh2ydnhw1mrgsy6i-python3.6-numpy-1.14.2' from 'https://cache.nixos.org'...
copying path '/nix/store/w665fdfvvprhpag3vfg59g25hazdf4gb-python3.6-pytz-2018.3' from 'https://cache.nixos.org'...
copying path '/nix/store/4lly6wh5li3s6q2rkk8zx2w1nm8q1pkm-python3.6-six-1.11.0' from 'https://cache.nixos.org'...
copying path '/nix/store/2150jqzv1zzgw1vzd4i2sfwkk53k98aw-python3.6-absl-py-0.1.13' from 'https://cache.nixos.org'...
copying path '/nix/store/qagjinn7pv78a1zpnwr8n01aamm67p2g-python3.6-python-dateutil-2.6.1' from 'https://cache.nixos.org'...
copying path '/nix/store/7sqbq59fwl4ypbmmjg6wzb5xdj3w651c-python3.6-python-gflags-3.1.2' from 'https://cache.nixos.org'...
copying path '/nix/store/7ac4qpvzpgb7g8p8mfbgslvh0r4990xi-python3.6-google-apputils-0.4.1' from 'https://cache.nixos.org'...
copying path '/nix/store/f8p64j7zkp9rggsxgxn5gg6psihvm5v1-python3.6-protobuf-3.5.1.1' from 'https://cache.nixos.org'...
copying path '/nix/store/ng3xslz56j2j16m8c27hi22rxi4wjnkd-python3.6-tensorflow-1.5.0' from 'https://cache.nixos.org'...
/nix/store/ng3xslz56j2j16m8c27hi22rxi4wjnkd-python3.6-tensorflow-1.5.0

@NixOS NixOS deleted a comment from GrahamcOfBorg May 17, 2018
@GrahamcOfBorg
Copy link

Failure on x86_64-linux (full log)

Attempted: python36Packages.tensorflow, python27Packages.tensorflow

The following builds were skipped because they don't evaluate on x86_64-linux: python36Packages.tensorflowWithCuda, python27Packages.tensorflowWithCuda

Partial log (click to expand)

Building: no action
WARNING: /build/output/external/grpc/WORKSPACE:1: Workspace name in /build/output/external/grpc/WORKSPACE (@com_github_grpc_grpc) does not match the name given in the repository's definition (@grpc); this will cause a build error in future versions
Building: no action
Building: no action
Building: no action
installing
fixed-output derivation produced path '/nix/store/jbf3q0c1slc4nv5cl19a3m3pkwl3lcnb-tensorflow-build-1.8.0-deps' with sha256 hash '1n0iz90pl0dcqbl8fscx3hfsxmxbph4xjq42nnyx453vsvxrfna2' instead of the expected hash '1fczzfhcg1va18rdmj9zgc11ah619pl8bny6hw51c51kbxr9fskc'
cannot build derivation '/nix/store/xg4dj9hg4zzwc8a17n06vimzp0v4z77f-tensorflow-build-1.8.0.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/nnim8qqg0zpy17463gj7plspp88x7v2r-python3.6-tensorflow-1.8.0.drv': 1 dependencies couldn't be built
error: build of '/nix/store/cxxgv4nrvnz4102l2vynmiimrr0lnv1k-python2.7-tensorflow-1.8.0.drv', '/nix/store/nnim8qqg0zpy17463gj7plspp88x7v2r-python3.6-tensorflow-1.8.0.drv' failed

@NixOS NixOS deleted a comment from GrahamcOfBorg May 17, 2018
@NixOS NixOS deleted a comment from GrahamcOfBorg May 17, 2018
@xeji
Copy link
Contributor

xeji commented May 17, 2018

(deleted ofborg results were caused by my incorrect build command, sorry for the noise)

@mboes please fix the build for x86_64-linux (or does this require bazel 0.13 now?)

@mboes
Copy link
Contributor Author

mboes commented May 17, 2018

@xeji fixed. I used the sha256 corresponding to the output of building with Bazel v0.13 previously. I set it to that of building with Bazel 0.12 now. IIUC we'll have to update this again when Bazel v0.13 gets merged.

Incidentally, even when building with --option build-use-sandbox true I'm getting reproducibility issues where supplying the wrong output hash sometimes still works locally.

@xeji
Copy link
Contributor

xeji commented May 17, 2018

@GrahamcOfBorg build python36Packages.tensorflow python27Packages.tensorflow

@GrahamcOfBorg
Copy link

No attempt on aarch64-linux (full log)

The following builds were skipped because they don't evaluate on aarch64-linux: python36Packages.tensorflow, python27Packages.tensorflow

Partial log (click to expand)


a) For `nixos-rebuild` you can set
  { nixpkgs.config.allowUnfree = true; }
in configuration.nix to override this.

b) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
  { allowUnfree = true; }
to ~/.config/nixpkgs/config.nix.


@GrahamcOfBorg
Copy link

Success on x86_64-darwin (full log)

Attempted: python36Packages.tensorflow

The following builds were skipped because they don't evaluate on x86_64-darwin: python27Packages.tensorflow

Partial log (click to expand)

a) For `nixos-rebuild` you can set
  { nixpkgs.config.allowBroken = true; }
in configuration.nix to override this.

b) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
  { allowBroken = true; }
to ~/.config/nixpkgs/config.nix.


/nix/store/ng3xslz56j2j16m8c27hi22rxi4wjnkd-python3.6-tensorflow-1.5.0

@GrahamcOfBorg
Copy link

Failure on x86_64-linux (full log)

Attempted: python36Packages.tensorflow, python27Packages.tensorflow

Partial log (click to expand)

Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
ERROR: error loading package '': Encountered error while reading extension file 'closure/defs.bzl': no such package '@io_bazel_rules_closure//closure': Error downloading [https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz, https://github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz] to /build/output/external/io_bazel_rules_closure/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz: All mirrors are down: [Unknown host: github.com, Unknown host: mirror.bazel.build]
ERROR: error loading package '': Encountered error while reading extension file 'closure/defs.bzl': no such package '@io_bazel_rules_closure//closure': Error downloading [https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz, https://github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz] to /build/output/external/io_bazel_rules_closure/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz: All mirrors are down: [Unknown host: github.com, Unknown host: mirror.bazel.build]
INFO: Elapsed time: 2.282s
FAILED: Build did NOT complete successfully (0 packages loaded)
builder for '/nix/store/zxnzd0mcwl1i52ymcdph6q7g5y01jpqa-tensorflow-build-1.8.0.drv' failed with exit code 1
cannot build derivation '/nix/store/y9i0c51asi1zw735bs5bfriixl18s9xy-python2.7-tensorflow-1.8.0.drv': 1 dependencies couldn't be built
error: build of '/nix/store/5ahivx3pzdjpjd2mpxnbkpfc35g9gjzn-python3.6-tensorflow-1.8.0.drv', '/nix/store/y9i0c51asi1zw735bs5bfriixl18s9xy-python2.7-tensorflow-1.8.0.drv' failed

@xeji xeji changed the title tensorflow: 1.5.0 -> 18.0 tensorflow: 1.5.0 -> 1.8.0 May 17, 2018
@zimbatm
Copy link
Member

zimbatm commented May 18, 2018

ERROR: error loading package '': Encountered error while reading extension file 'closure/defs.bzl': no such package '@io_bazel_rules_closure//closure': Error downloading [https://mirror.bazel.build/github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz, https://github.com/bazelbuild/rules_closure/archive/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz] to /build/output/external/io_bazel_rules_closure/08039ba8ca59f64248bb3b6ae016460fe9c9914f.tar.gz: All mirrors are down: [Unknown host: github.com, Unknown host: mirror.bazel.build]

What is the best way to pre-populate these external fetches done by Bazel? Nix doesn't allow the downloading because of the sandbox.

@globin
Copy link
Member

globin commented Jun 7, 2018

cc @abbradar

@mboes
Copy link
Contributor Author

mboes commented Jun 23, 2018

Shelving this for now.

@mboes mboes closed this Jun 23, 2018
@grwlf
Copy link
Contributor

grwlf commented Jun 28, 2018

Just a note: I've independently attempted to build tensorflow 1.8 from source. Used older bazel 0.10 for this. Faced issue related to hardening:

NixOS appends -D_FORTIFY_SOURCE macro to any gcc arguments unless hardeningDisable=all is present in the environment. Our build expression sets this variable, but, unfortunately, tf1.8 prepends env - before every bazel action, so hardening become re-enabled. This eventually leads to "macro redefined" warning from gcc, and finally, -Werror does its job to stop the show. Ridiculous.

I couldn't find how to get through env - in bazel rules, but I think that cooking a special gcc without a hardening might help. What is a correct way to do it in NixOS? Probably, something related to wrapCC function.. Will be glad to hear opinions.

@uri-canva
Copy link
Contributor

Did you use bazel or bazel with enableNixHacks (build-bazel-package has hacks enabled)? Bazel does a lot of sandboxing when building, and Nix ironically requires tools it invokes to not do sandboxing, so the bazel derivation has some hacks you can enable to weaken the bazel sandbox. It doesn't remove it completely though, so there might still be situations where the environment isn't propagated or things like that.

I've hit similar issues and I think there's two different ways of going forward when using bazel in Nix:

  1. Embrace the bazel sandboxing, don't attempt to interfere with it, and pass the nix overrides using the --action_env, --copt, --linkopt etc bazel flags.
  2. Embrace the hacks, and patch bazel extensively to invoke only nix tools, and to propagate the environment to them.

I'm leaning towards (1), but that means bazel should use unwrapped tools in nix, not the wrappers.

@grwlf
Copy link
Contributor

grwlf commented Jun 29, 2018

I turned nixHacks off, but I think caching or sandboxing is not the cause of my errors. Bazel seems to react on setting CC and CXX correctly. Regarding your options:

  • --action_env doesn't seem to work in tensorflow's case because TF prepends exec env - PATH=... PWD=... to every action, so the environment is almost empty at the time of GCC execution. This is not a bazel problem.
  • --copt and --linkoopt might help but unfortunately, it looks like they ignore extra -U_FORTIFY_SOURCE

Looks like this problem was very carefully planted :) I'll continue trying to wrap GCC in a way that disable hardening. Passing unwrapped CC cause some problems with execve which I don't understand yet.

@uri-canva
Copy link
Contributor

By TF do you mean the normal bazel build process or does TF do something special? exec env - PATH=... PWD=... is what Bazel always does, but you can pass envs through using --action_env, except for cc actions, see bazelbuild/bazel#3642 for example.

You can unwrap the various layer and test things by using nix-shell and bazel's -s, --verbose_failures and --sandbox_debug flags.

@grwlf
Copy link
Contributor

grwlf commented Jun 29, 2018

** Edit **
I expected that it was a tensorflow-specific thing since I didn't see any reaction on --action_env (and didn't expect bazel designers to implement such a non-intuitive behavior). Probably I was wrong. By "cc actions" do you mean executing GCC? If so, than it may be exactly my case.

I was able to overcome the _FORTIFY_SOURCE redefined -Werror problem with the following hand-made GCC wrapper

    gccnh = stdenv.mkDerivation {
      name = "gccnh";
      buildCommand = ''
        . $stdenv/setup
        mkdir -p $out/bin

        for prog in as  c++  cc  cpp  g++  ld  ld.bfd  ld.gold ; do
          ln -s ${stdenv.cc}/bin/$prog $out/bin/$prog
        done

        cat >$out/bin/gcc <<"EOF"
        #!/bin/sh
        export hardeningDisable=all
        exec "${stdenv.cc}/bin/gcc" "''${extraFlagsArray[@]}" "$@"
        EOF
        chmod +x $out/bin/gcc
        '';
    };

I used this wrapper manually, inside nix-shell, as following:

$ CC="$gccnh/bin/gcc" bazel build $bazelTarget

Not sure if hardening was the only problem or not (bazel still works)

@grwlf
Copy link
Contributor

grwlf commented Jun 29, 2018

No luck for now. New error is about numpy. Tensorflow passes PYTHON_LIB_PATH variable pointing to location which doesn't contain numpy (note, I am using nix-shell). This variable unfortunately doesn't support PATH-style colon-separation, so it is unclear how to pass both python core and numpy.

 (cd /run/user/1048/.cache/bazel/_bazel_mironov/ca2d6ccbae40c057e3bf9480a539f8c0/execroot/org_tensorflow && \
  exec env - \
    PATH=... \
    PYTHON_BIN_PATH=/nix/store/ql9052zdcpx0a74d5g85d2qrnjw0hxmz-python3-3.6.5/bin/python3.6m \
    PYTHON_LIB_PATH=/run/user/1048/site-packages \
    TF_DOWNLOAD_CLANG=0 \
    TF_NEED_CUDA=0 \
    TF_NEED_OPENCL_SYCL=0 \
  /nix/store/f2vw9r78fhaq15rcyvllzz2ayafd5n0z-bash/bin/bash ...

...

  File "/run/user/1048/.cache/bazel/_bazel_mironov/ca2d6ccbae40c057e3bf9480a539f8c0/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/tools/api/generator/create_python_api.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 47, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

@globin
Copy link
Member

globin commented Jun 29, 2018

@grwlf you can use (python3.withPackages (ps: with ps; [ numpy ]))

@uri-canva
Copy link
Contributor

For the gcc flags see #42832 for an example of configuring a bazel toolchain. That makes the whole nix hacks and wrappers redundant at least as far as bazel goes.

@grwlf
Copy link
Contributor

grwlf commented Jul 8, 2018

Tried to apply the python environment advice by @globin , without success. Basically, the problem starts when Bazel executes create_python_api with the following commad:

 (cd /home/grwlf/tmp/out/execroot/org_tensorflow && \
  exec env - \
    LD_LIBRARY_PATH=/run/opengl-driver/lib:/run/opengl-driver-32/lib \
    PATH=... \
    PYTHON_BIN_PATH=/nix/store/zrbzl421n5lsm84b6drwn08i10wwjbvj-python3-3.6.5-env/bin/python \
    PYTHON_LIB_PATH=/nix/store/zrbzl421n5lsm84b6drwn08i10wwjbvj-python3-3.6.5-env/lib/python3.6/site-packages \
    TF_DOWNLOAD_CLANG=0 \
    TF_NEED_CUDA=0 \
    TF_NEED_OPENCL_SYCL=0 \
  /nix/store/f2vw9r78fhaq15rcyvllzz2ayafd5n0z-bash/bin/bash -c '
    source external/bazel_tools/tools/genrule/genrule-setup.sh; 
    bazel-out/host/bin/tensorflow/tools/api/generator/create_python_api 
         bazel-out/k8-py3-opt/genfiles/tensorflow/tools/api/generator/api/__init__.py
         < lots of TF python files here >'

Internally, create_python_api sets up environment and launches org_tensorflow/tensorflow/tools/api/generator/create_python_api.py. During setting up, it seems to ignore PYTHON_LIB_PATH which does contain numpy, but reads PYTHONPATH which is cleared by bazel. I hoped that python interpreter obtained from PYTHON_BIN_PATH would remember numpy from its tree, but it is not the case. Later, create_python_api.py crashes with error

Traceback (most recent call last):
  File "/home/grwlf/tmp/out/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/tools/api/generator/create_python_api.runfiles/org_tensorflow/tensorflow/tools/api/generator/create_python_api.py", line 26, in <module>
    from tensorflow.python.util import tf_decorator
  File "/home/grwlf/tmp/out/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/tools/api/generator/create_python_api.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 47, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

At the moment, the expected origin of numpy is not clear for me. Should TF already have its own version of numpy, or is it the build environment which must provide it? If so, how one should do it, having the environment cleanup made by bazel?

@grwlf
Copy link
Contributor

grwlf commented Jul 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants