New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] tensorflow: re-enable build from source #63616
Conversation
7a588e1
to
e7c7310
Compare
Of course a mix of perfectionism and sunk-cost fallacy drove me to continue to work on it. I've included a tf 1.14 update and an update of the dependencies. Turns out the source build for 1.14 works after all, we just have to tell tensorflow not to check the maximum bazel version. The value of taking some libs from the system is a bit questionable, but it's probably the right thing to do anyways. |
e7c7310
to
94357aa
Compare
Wow! That's really cool, I didn't expect anyone else having enough patience to dive into this :D I'll hopefully be able to test during this week when replacement part for my PC arrives - Tensorflow is painfully slow to build otherwise. |
Thanks! Apparently at least two people are #63208 This isn't quite done yet, had to debug a nasty issue with the cuda build (#38471 (comment)). Then some cleanup. |
639cb01
to
a70d14f
Compare
Hm, can you explain why you need |
That is to fix the issue I explained here, where Unfortunately that still doesn't fix the underlying problem, which is the build (only with cuda for some reason) failing because bazel tries to execute |
a70d14f
to
8d237ee
Compare
Cuda build finally works! Cleanup still needed.. |
8d237ee
to
21db4a7
Compare
Some cleanup done. @Profpatsch the bazel change ended up not being necessary here, since tensorflow sets its own toolchain. It may still be a good idea to fix up bazels default toolchain though. Unfortunately I couldn't figure out where the actual result ends up, so I couldn't verify my patches. |
Forked out the update to #63873, because that is ready for merge now. |
Thanks! This PR isn't quite done however. The python2 build still sometimes fails, I'll probably need to add another fix to python2 (in addition to #64067). The reason I haven't worked on this is that I currently need the hardware for more important things (i.e. actually using tensorflow as opposed to packaging it). I'll likely get back to it some time, but if you want to take over I certainly won't complain either :) |
This also changes it to a from-source build.
Timestamp verification skip is no longer needed (not sure why). Generally we better off always using the environment hack for all packages because that ensures all NIX_* flags are correctly applied. One possible improvement in future is to filter only NIX_* variables to passthru in Bazel.
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively. Now the derivation builds both libtensorflow and the Python package and puts them into different outputs. Quite a bit of improvements were done on the top, including: * Use official tag revision as source, not a branch; * Use all system libraries possible (before only one was actually used); * Move various environment variables to the derivation itself from hooks; * Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries; * And more that I forgot!
21db4a7
to
9a35096
Compare
Without this Bazel always picks Python 3 which breaks Python 2 packages. Strangely enough just dropping this patch works, with all `bazel.tests` passing.
+ /* first construct the filename with a .tmp suffix */ | ||
+ cpathname_tmp = PyMem_MALLOC(strlen(cpathname) + 5); | ||
+ if (cpathname_tmp == NULL) { | ||
+ PyErr_Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we need to clear the error here because this function is okay to fail. I updated my patch.
I didn't use goto
because the original author preferred less control flow and I wanted to keep the original patch's spirit.
This is python bug https://bugs.python.org/issue13146. Fixed since python 3.4. It makes pyc creation atomic, preventing a race condition. The patch has been rebased on our deterministic build patch. It wasn't backported to python 2.7 because there was a complaint about changed semantics. Since files are now created in a temporary directory and then moved, symlinks will be overridden. See https://bugs.python.org/issue17222. That is an edge-case however. Ubuntu and debian have backported the fix in 2013 already, making it mainstream enough for us to adopt. (cherry picked from commit 9db3a58)
Turns out fixing this only in importlib is not sufficient and we need to backport CPython part of the fix too. This patch is based on https://hg.python.org/cpython/rev/c16063765d3a but because the code around is different there are some changes (C-strings instead of Python objects etc.) With this patch Tensorflow builds successfully on many-core machine. (cherry picked from commit da295a1)
They sometimes take separate flags.
Now need to be passed in the configure phase.
9a35096
to
0c3db9a
Compare
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively. Now the derivation builds both libtensorflow and the Python package and puts them into different outputs. Quite a bit of improvements were done on the top, including: * Use official tag revision as source, not a branch; * Use all system libraries possible (before only one was actually used); * Move various environment variables to the derivation itself from hooks; * Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries; * And more that I forgot!
Motivation for this change
This resurrects the tensorflow source build. Its WIP because it builds an unreleased tensorflow version (current release didn't support the current bazel) and needs some cleanup, but it works. Builds upon #63580.
My motivation for this was that I need to run tensorflow on a VM without avx instructions while the official wheel requires avx (tensorflow/tensorflow#17411). There are still cpus sold without AVX support, so we shouldn't require it by default.
Not sure yet if I'll actually push it over the finish line, since it "works for me" for now. But anyway, here it is.
CC @volth, @abbradar who have touched the source build before.
Things done
sandbox
innix.conf
on non-NixOS)nix-shell -p nix-review --run "nix-review wip"
./result/bin/
)nix path-info -S
before and after)