Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python.pkgs.tensorflow: 1.14.0 -> 1.15.0 #71282

Merged
merged 6 commits into from
Nov 13, 2019

Conversation

timokau
Copy link
Member

@timokau timokau commented Oct 17, 2019

Motivation for this change

Continuation of #70910.
fixes #71578.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Sorry, something went wrong.

@timokau timokau requested a review from FRidh as a code owner October 17, 2019 11:52
@timokau timokau changed the title Tensorflow 1.15.0 python.pkgs.tensorflow: 1.14.0 -> 1.15.0 Oct 17, 2019
@timokau
Copy link
Member Author

timokau commented Oct 17, 2019

I have decided to keep the default to no-cuda for now, unless there are any objections.

@timokau
Copy link
Member Author

timokau commented Oct 17, 2019

This should also fix the tensorflow build, which was broken by #69252.

@timokau
Copy link
Member Author

timokau commented Oct 17, 2019

Unfortunately the build still doesn't quite work with bazel 1.0. Upstream is tracking this here: tensorflow/tensorflow#33240

@timokau timokau added the 2.status: work-in-progress This PR isn't done label Oct 17, 2019
@timokau timokau changed the title python.pkgs.tensorflow: 1.14.0 -> 1.15.0 [WIP] python.pkgs.tensorflow: 1.14.0 -> 1.15.0 Oct 17, 2019
@timokau timokau added the 2.status: wait-for-upstream Waiting for upstream fix (or their other action). label Oct 17, 2019
@flokli
Copy link
Contributor

flokli commented Oct 21, 2019

Upstream is working on a fix in tensorflow/tensorflow#33415.

@coreyoconnor
Copy link
Contributor

Did a quick pass at applying this patch. Unfortunately, I'm unable to build this with upstream's fix applied. Will continue looking into it some later day.

Adding to here FYI:

on that nixpkgs fork the build fails with:

/nix/store/jy6zl5dkw5rcjczm1arjqxz95l9d40i7-binutils-2.31.1/bin/ld: bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): in function `std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorflow::ApiDef>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorflow::ApiDef> >, std::__deta
il::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail
::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<
char>, std::allocator<char> > const, tensorflow::ApiDef>, true>*)':
op_gen_lib.cc:(.text._ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N10tensorflow6ApiDefEESaISA_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSC_18_Mod_range_hashingENSC_20_Default_ranged_hashENSC_20_Prime_reha
sh_policyENSC_17_Hashtable_traitsILb1ELb0ELb1EEEE21_M_insert_unique_nodeEmmPNSC_10_Hash_nodeISA_Lb1EEE[_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N10tensorflow6ApiDefEESaISA_ENSt8__detail10_Select1stESt8equal_toIS5_
ESt4hashIS5_ENSC_18_Mod_range_hashingENSC_20_Default_ranged_hashENSC_20_Prime_rehash_policyENSC_17_Hashtable_traitsILb1ELb0ELb1EEEE21_M_insert_unique_nodeEmmPNSC_10_Hash_nodeISA_Lb1EEE]+0xe1): undefined reference to `tensorflow::ApiDef::~ApiDef()'
/nix/store/jy6zl5dkw5rcjczm1arjqxz95l9d40i7-binutils-2.31.1/bin/ld: bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): in function `std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, s
td::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorflow::ApiDef>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorflow::ApiDef> >, st
d::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std:
:__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
op_gen_lib.cc:(.text._ZNSt8__detail9_Map_baseINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS6_N10tensorflow6ApiDefEESaISB_ENS_10_Select1stESt8equal_toIS6_ESt4hashIS6_ENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_po
licyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS8_[_ZNSt8__detail9_Map_baseINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS6_N10tensorflow6ApiDefEESaISB_ENS_10_Select1stESt8equal_toIS6_ESt4hashIS6_ENS_18_Mod_range_hashingENS_20_Default
_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS8_]+0xa8): undefined reference to `tensorflow::ApiDef::ApiDef()'
/nix/store/jy6zl5dkw5rcjczm1arjqxz95l9d40i7-binutils-2.31.1/bin/ld: bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): in function `std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorflow::ApiDef>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorflow::ApiDef> >, std::__deta
il::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail
::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()':
op_gen_lib.cc:(.text._ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N10tensorflow6ApiDefEESaISA_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSC_18_Mod_range_hashingENSC_20_Default_ranged_hashENSC_20_Prime_reha
sh_policyENSC_17_Hashtable_traitsILb1ELb0ELb1EEEED2Ev[_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_N10tensorflow6ApiDefEESaISA_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSC_18_Mod_range_hashingENSC_20_Defa
ult_ranged_hashENSC_20_Prime_rehash_policyENSC_17_Hashtable_traitsILb1ELb0ELb1EEEED5Ev]+0x28): undefined reference to `tensorflow::ApiDef::~ApiDef()'
/nix/store/jy6zl5dkw5rcjczm1arjqxz95l9d40i7-binutils-2.31.1/bin/ld: bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): in function `tensorflow::ApiDefMap::ApiDefMap(tensorflow::OpList const&)':
op_gen_lib.cc:(.text._ZN10tensorflow9ApiDefMapC2ERKNS_6OpListE+0xcc): undefined reference to `tensorflow::ApiDef::ApiDef()'
....

@coreyoconnor
Copy link
Contributor

Ignore my prior comment. Used an out-of-date version of this pull request. Re-testing.

@coreyoconnor
Copy link
Contributor

Re-tested using master + this patch, with adjustments to the deps SHA. On our build farm, both cuda and non-cuda fail to compile with similar link errors:

bazel-out/host/bin/tensorflow/python/gen_nn_ops_py_wrappers_cc: symbol lookup error: /build/output/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/../../_solib_k8/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined symbol: _ZN10tensorflow16ProtoDebugStringB5cxx11ERKNS_16DeviceAttributesE

Unfortunate that the tensorflow build is a monolith. :( Nothing for nixpkgs to resolve afaict. Kinda surprising tho. Pain to test/build.

@timor
Copy link
Member

timor commented Nov 11, 2019

@coreyoconnor This seems to stem from protobuf being an alias for protobuf3_7. Tensorflow-1.15.0 does seem to require protobuf3_6 though.

Relevant upstream commit: tensorflow/tensorflow@f92a0e6

Edit... or maybe not. I have a hard time telling which code is from which project here... Actually, it seems to be that they actually upgraded protobuf to 3.8.0 in v1.15.0, but that commit above is not present in 1.15.0

@timokau
Copy link
Member Author

timokau commented Nov 12, 2019

Thanks to @scentini in tensorflow/tensorflow#33415, it builds now!

Things left to do:

  • some sanity checking
  • fill in the cuda hash

If anybody else wants to do some testing, now is the time.

@timokau timokau force-pushed the tensorflow-1.15.0 branch 2 times, most recently from 7352080 to 62108eb Compare November 12, 2019 09:55
@coreyoconnor
Copy link
Contributor

bazel-out/host/bin/tensorflow/python/gen_nn_ops_py_wrappers_cc: symbol lookup error: /build/output/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/../../_solib_k8/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so.1: undefined symbol: _ZN10tensorflow16ProtoDebugStringB5cxx11ERKNS_16DeviceAttributesE

this issue is resolved with timokau changes. Unfortunately, I missed the latest patch and ran into the gast error. XD

Testing again.

@timokau
Copy link
Member Author

timokau commented Nov 12, 2019

Yeah, sorry about that. I made some mistake during fixup before push. Should work now.

@coreyoconnor
Copy link
Contributor

@timokau no problem!

Subsequent build failed on this

      # for some reason some of the required libs are in the targets/x86_64-linux                                                                                                             
      # directory; not sure why but this works around it                                                                                                                                      
      "${cudatoolkit}/targets/${stdenv.system}"                                                                                                                                               

I'm building using cudatoolkit 10. Which is not the default in all-packages. I wonder if this workaround is specific to cudatoolkit 9?

@coreyoconnor
Copy link
Contributor

Success! Compiled without issue with cudatoolkit_10 (once I removed that workaround). So I presume yes: that is specific to cudatoolkit_9.

@timokau
Copy link
Member Author

timokau commented Nov 13, 2019

Thanks for testing! I don't quite understand what you're saying though. So the workaround is not necessary for a newer cudatoolkit? That's good, but why does it fail? If the workaround is not necessary, it still shouldn't do any harm right?

Now needed at runtime with the python2 build.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@timokau timokau changed the title [WIP] python.pkgs.tensorflow: 1.14.0 -> 1.15.0 python.pkgs.tensorflow: 1.14.0 -> 1.15.0 Nov 13, 2019
@timokau timokau removed 2.status: wait-for-upstream Waiting for upstream fix (or their other action). 2.status: work-in-progress This PR isn't done labels Nov 13, 2019
@timokau
Copy link
Member Author

timokau commented Nov 13, 2019

Merging now, since this has been broken on master for so long. Any fixups can be made in followup PRs. For example, python2 is still broken. It's probably not too difficult to fix, but I don't plan to invest more time in this for now.

@timokau timokau merged commit 0ae46ca into NixOS:master Nov 13, 2019
@timokau timokau deleted the tensorflow-1.15.0 branch November 13, 2019 16:11
@coreyoconnor
Copy link
Contributor

coreyoconnor commented Nov 13, 2019 via email

@coreyoconnor
Copy link
Contributor

coreyoconnor commented Nov 13, 2019 via email

@timokau
Copy link
Member Author

timokau commented Nov 13, 2019

Oh, so the path doesn't exist. Its weird that the workaround is necessary at all. The best solution would probably be to fix the cudatoolkit 9 build to make the workaround unnecessary for both versions.

Do feel free to create a PR to fix this either in the tensorflow build or in the cudatoolkit build. If you do, ping me and I'll review (to the best of my ability, since I don't have a nvida card) :)

@coreyoconnor
Copy link
Contributor

@timokau

will do! Thanks again for figuring this out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

python3Packages.tensorflow build is broken on master
4 participants