New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build Tensorflow from source #30434
Build Tensorflow from source #30434
Conversation
cc @jyp -- I won't have access to NVIDIA GPU for quite some time yet so it'd be cool if you could test this. |
@TravisWhitaker , you may also be interested in this. |
Your bazel changes seem helpful more broadly than here too! |
@copumpkin Yeah, I'd imagine this approach can be used to build any other Bazel project on Nix. If you have anything else in mind we could merge the Bazel patch separately faster. |
@abbradar This looks excellent, thanks so much for doing this! I'll give this a go tomorrow for sure. |
}; | ||
|
||
cudatoolkit8 = common { | ||
version = "8.0.61"; | ||
url = https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run; | ||
url = "https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I semi-conciously replace instances of unquoted URLs to quoted ones because my terminal sees ;
as a part of URL otherwise and doesn't allow me to open it with a click.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the build time for Tensorflow? |
@edolstra ~20 minutes for pure CPU build and ~35 for CUDA-enabled one on my machine -- this is very subjective though, I didn't get any measurements. EDIT: this is on Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz (Haswell laptop-grade i7). |
Okay, that's not too bad :-) |
Impressive. I gave up on this one awhile back for sanity reasons. Packages be warned, don't mess with @abbradar . |
@abbradar Bazel fails to build with trying to access
|
@jyp Not sure what happens here. Do you have sandbox enabled? |
@abbradar No, I am not using sandboxing. Should I? I see now that |
@jyp Yeah, it's disabled by default for users because of performance reasons but practically required if you start building stuff.
|
@abbradar
|
Wha~... Not sure what happens here again, haven't seen that before. What do you mean by "single process"? EDIT: do you use my branch or cherry-pick my patches atop other tree? |
@abbradar I meant not using "-j n" option. After activating sandboxing (using /etc/nix/nix.conf, because I am not on NixOS but Fedora), I am getting the same error, but apparently at a different point:
|
About single process: this seems to be a bug in our Bazel package then, I'll look at this later. About your setup: do you use single-user or multi-user Nix? Try to build it with EDIT: |
@abbradar I'm using multi-user nix.
Here is the output of ldd:
```
$ ldd
/tmp/nix-build-python3.6-tensorflow-1.3.1.drv-0/tensorflow-v1.3.1-src/output/execroot/tensorflow-v1.3.1-src/bazel-out/host/bin/external/protobuf/js_embed
linux-vdso.so.1 (0x00007ffe39da2000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ff8f174b000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ff8f1534000)
libc.so.6 => /lib64/libc.so.6 (0x00007ff8f116e000)
libm.so.6 => /lib64/libm.so.6 (0x00007ff8f0e65000)
/nix/store/yydnhs7migvlbl48wpsxan1yvq2icbr9-glibc-2.25-49/lib/ld-linux-x86-64.so.2
=> /lib64/ld-linux-x86-64.so.2 (0x0000560e970ae000)
```
|
It seems to impurely use your host OS's libraries for some reason. That shouldn't happen with sandboxing. Do you use |
@abbradar Yes, I am using |
|
I managed to make my laptop's hybrid NVIDIA to work with CUDA and verified that TensorFlow works for me. |
That's great :) But unfortunately I am still getting the same error after re-compiling bazel and tensorflow. How can I verify that the sandbox mode is enabled? |
Try to build this:
with |
Thanks. So for some reason nix-* disregards my configuration |
Did you restart |
I'm also failing on $ nix-build --option build-use-sandbox true -E 'with import ./default.nix {}; pythonPackages.tensorflow.override {cudaSupport = true; cudaCapabilities = ["3.7" "6.1"];}' I had this problem with my impure Tensorflow derivation too (I need to build with a specific Bazel version), and hacked around it with: bazel.overrideAttrs (a: rec
{
preBuild = ''
rm -rf /tmp/.bazel
'';
postFixup = ''
rm -rf /tmp/.bazel
'';
}); |
@abbradar Yes, that is what I did, with apparently no effect. |
@jyp Strange! What happens to you with EDIT: but just to be sure: |
Finally I managed to enable the sandbox and I'm getting yet another error:
|
Hmm, can you allow
and fixing your |
I've pushed an update which hopefully fixes Bazel's @edolstra , I've measured TensorFlow compile time with CUDA enabled on my machine -- 55 minutes. Seems I tend to strongly underestimate time ^_^" |
@abbradar any thoughts on upstreaming your changes to make Bazel make fewer assumptions about its build environment? |
@copumpkin The problem is that my patch is strongly a hack: it breaks Bazel environment isolation (which is a bad thing really but we want it since we have our own!) and breaks its checksums to achieve network isolation. All other changes are our usual Nix package building business like |
I don't understand what this means. |
@jyp This derivation should work for you. If it doesn't, that means your sandbox doesn't include |
@abbradar It works :) And thanks a lot for guiding me through all this! A few remarks:
Could we enable those build flags? |
|
It would be nice to be able make the flags available anyway, I am guessing that they make quite a bit of performance difference. Also, some options are pretty much safe to enable (SSE) since they were introduced on CPUs 10 years ago. (Who is using tensorflow on such old machines?) |
|
||
disabled = isPyPy || pythonOlder "2.6" || (isPy3k && pythonOlder "3.3"); | ||
|
||
src = fetchurl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fetchPypi
# the fix for which hasn't been merged yet. | ||
|
||
# keep Nose around since running the tests by hand is possible from Python or bash | ||
buildInputs = [ nose ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkInputs
* Skip verifying checksums for already fetched packages. Needed for two-staged building in Nix: 1. Build a fixed derivation with `bazel fetch` (filtered out of non-reproducable bits). 2. Build an actual derivation which uses fetched dependencies (skipping checksums needed here because they depend on the build directory). * Don't clean environment variables for children processes. Needed for Nix compiler wrappers.
Also cleanup TMPDIR.
Build from source. It's implemented as a two-staged Bazel build (see also 546b4aec776b3ea676bb4d58d89751919ce4f1ef).
You should be able to use It might be better to enumerate Regarding downloads, generally (that is, assuming the respective build systems are well behaved) you should be able to specify a repository for anything that would need to be downloaded, which you can point at a store path when building as apart of a nix derivation. You can see where I leverage this to point Envoy's build system at third-party dependencies that are built by Nix: nixpkgs/pkgs/tools/networking/envoy/default.nix Lines 164 to 168 in 30f6d63
If you can't prevent fetching stuff over the network, that's either a bug in Tensorflow's build system or a bug in Bazel. I'll see if I can find time to see if I can improve the package, and report any issues upstream. |
@cstrahan I've seen this article but isn't this a design document? I thought that this functionality is not implemented yet -- if I'm mistaken it's great, let's then just passthru About local repositories -- I didn't know that! From the build file it seemed that it'd always just download dependencies over the network but I'm very unfamiliar with Bazel. It would be great if you could take a look at it. |
Motivation for this change
(Do not merge before #30433!)
What it says.
Things done
build-use-sandbox
innix.conf
on non-NixOS)nix-shell -p nox --run "nox-review wip"
./result/bin/
)This is based atop #30433 , so there are extra commits (you are only interested in the last two).
Possible problem with this are long compile times with CUDA (half an hour+) where previously we uesd prebuilt binaries.
Bazel is a wonderful tool for building projects -- it's certainly a genius idea to create and use a 500k LoC Java build manager tool which eats 1.5G of memory for building C++. Unfortunately, Nix is not enterprise enough so our ways to isolate builds were not supported by Basel, making building a project with it effectively impossible in Nix (which is why TensorFlow used binary distribution before). Specifically, Bazel tries to download all external project dependencies by itself, which fails because of Nix isolation. In vanilla Bazel there is no way to circumvent this (please correct me if there's a better way!).
My solution is to add a special patched version of Bazel which skips checking of fetched dependencies checksums (which depend, among other things, on build directory paths). Then we can do two-staged build as described in 9b3559f.
This, along with TensorFlow's own pecularities (special scripts for everything, awful compile times because all dependent libraries like CURL or ffmpeg are built statically, incomprehensible build system scripts even considering Bazel etc.) made this PR a work of three or four straight days. I'd say this gets a solid second place in my list of my most awful packages after Telegram.
Sorry for the rant ~_~