Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] tensorflow: re-enable build from source #63616

Closed
wants to merge 11 commits into from

Conversation

timokau
Copy link
Member

@timokau timokau commented Jun 21, 2019

Motivation for this change

This resurrects the tensorflow source build. Its WIP because it builds an unreleased tensorflow version (current release didn't support the current bazel) and needs some cleanup, but it works. Builds upon #63580.

My motivation for this was that I need to run tensorflow on a VM without avx instructions while the official wheel requires avx (tensorflow/tensorflow#17411). There are still cpus sold without AVX support, so we shouldn't require it by default.

Not sure yet if I'll actually push it over the finish line, since it "works for me" for now. But anyway, here it is.

CC @volth, @abbradar who have touched the source build before.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@timokau timokau requested a review from FRidh as a code owner June 21, 2019 12:55
@timokau timokau changed the title tensorflow: re-enable build from source [WIP] tensorflow: re-enable build from source Jun 21, 2019
@ofborg ofborg bot requested a review from abbradar June 21, 2019 13:08
@timokau
Copy link
Member Author

timokau commented Jun 22, 2019

Of course a mix of perfectionism and sunk-cost fallacy drove me to continue to work on it.

I've included a tf 1.14 update and an update of the dependencies. Turns out the source build for 1.14 works after all, we just have to tell tensorflow not to check the maximum bazel version.

The value of taking some libs from the system is a bit questionable, but it's probably the right thing to do anyways.

@abbradar
Copy link
Member

Wow! That's really cool, I didn't expect anyone else having enough patience to dive into this :D

I'll hopefully be able to test during this week when replacement part for my PC arrives - Tensorflow is painfully slow to build otherwise.

This was referenced Jun 24, 2019
@timokau
Copy link
Member Author

timokau commented Jun 24, 2019

Thanks! Apparently at least two people are #63208

This isn't quite done yet, had to debug a nasty issue with the cuda build (#38471 (comment)). Then some cleanup.

@Profpatsch
Copy link
Member

Hm, can you explain why you need gcc and binutils in the bazel derivation for this to work? What’s the errors you are getting otherwise?

@timokau
Copy link
Member Author

timokau commented Jun 25, 2019

That is to fix the issue I explained here, where bazel tries to take the default locations for certain tools from PATH during build time.

Unfortunately that still doesn't fix the underlying problem, which is the build (only with cuda for some reason) failing because bazel tries to execute /usr/bin/ar. More details here: bazelbuild/bazel#8715

@timokau
Copy link
Member Author

timokau commented Jun 25, 2019

Cuda build finally works! Cleanup still needed..

@timokau
Copy link
Member Author

timokau commented Jun 26, 2019

Some cleanup done.

@Profpatsch the bazel change ended up not being necessary here, since tensorflow sets its own toolchain. It may still be a good idea to fix up bazels default toolchain though. Unfortunately I couldn't figure out where the actual result ends up, so I couldn't verify my patches.

@timokau
Copy link
Member Author

timokau commented Jun 27, 2019

Forked out the update to #63873, because that is ready for merge now.

@abbradar
Copy link
Member

I work on merging both @timokau and @yorickvP's changes into one derivation which produces both a wheel and a library.

@timokau
Copy link
Member Author

timokau commented Jul 12, 2019

Thanks!

This PR isn't quite done however. The python2 build still sometimes fails, I'll probably need to add another fix to python2 (in addition to #64067). The reason I haven't worked on this is that I currently need the hardware for more important things (i.e. actually using tensorflow as opposed to packaging it). I'll likely get back to it some time, but if you want to take over I certainly won't complain either :)

Timestamp verification skip is no longer needed (not sure why). Generally we
better off always using the environment hack for all packages because that
ensures all NIX_* flags are correctly applied.

One possible improvement in future is to filter only NIX_* variables to
passthru in Bazel.
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively.
Now the derivation builds both libtensorflow and the Python package and puts them into
different outputs.

Quite a bit of improvements were done on the top, including:

* Use official tag revision as source, not a branch;
* Use all system libraries possible (before only one was actually used);
* Move various environment variables to the derivation itself from hooks;
* Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries;
* And more that I forgot!
Without this Bazel always picks Python 3 which breaks Python 2 packages.
Strangely enough just dropping this patch works, with all `bazel.tests`
passing.
+ /* first construct the filename with a .tmp suffix */
+ cpathname_tmp = PyMem_MALLOC(strlen(cpathname) + 5);
+ if (cpathname_tmp == NULL) {
+ PyErr_Clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we need to clear the error here because this function is okay to fail. I updated my patch.

I didn't use goto because the original author preferred less control flow and I wanted to keep the original patch's spirit.

timokau and others added 4 commits July 18, 2019 11:00
This is python bug https://bugs.python.org/issue13146. Fixed since
python 3.4. It makes pyc creation atomic, preventing a race condition.
The patch has been rebased on our deterministic build patch.

It wasn't backported to python 2.7 because there was a complaint about
changed semantics. Since files are now created in a temporary directory
and then moved, symlinks will be overridden. See
https://bugs.python.org/issue17222.

That is an edge-case however. Ubuntu and debian have backported the fix
in 2013 already, making it mainstream enough for us to adopt.

(cherry picked from commit 9db3a58)
Turns out fixing this only in importlib is not sufficient and we
need to backport CPython part of the fix too.

This patch is based on https://hg.python.org/cpython/rev/c16063765d3a
but because the code around is different there are some changes (C-strings
instead of Python objects etc.)

With this patch Tensorflow builds successfully on many-core machine.

(cherry picked from commit da295a1)
They sometimes take separate flags.
Now need to be passed in the configure phase.
abbradar added a commit to abbradar/nixpkgs that referenced this pull request Jul 31, 2019
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively.
Now the derivation builds both libtensorflow and the Python package and puts them into
different outputs.

Quite a bit of improvements were done on the top, including:

* Use official tag revision as source, not a branch;
* Use all system libraries possible (before only one was actually used);
* Move various environment variables to the derivation itself from hooks;
* Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries;
* And more that I forgot!
@timokau
Copy link
Member Author

timokau commented Aug 8, 2019

Superseded by #64716, thanks @abbradar!

@timokau timokau closed this Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants