Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Tensorflow and libtensorflow from source #64716

Merged
merged 9 commits into from Jul 31, 2019

Conversation

abbradar
Copy link
Member

@abbradar abbradar commented Jul 13, 2019

Motivation for this change

Restore Tensorflow source build while leaving binary builds . Based on @yorickvP and @timokau work.with some parts taken from both.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

cc @uri-canva for changes regaring Bazel.

Tested both CUDA and non-CUDA builds of the Python library, both source and binary builds. Didn't test libtensorflow, I assume that if Python library works and ldd output is okay libtensorflow is okay too.

@abbradar
Copy link
Member Author

Forgot to mention actual previous PRs: #63208 and #63616.

@abbradar
Copy link
Member Author

Darwin still uses binary builds for now. I don't have a Darwin machine so I can't test if source build works and fix it if it doesn't - unless someone wants to take on that role!

@abbradar abbradar requested a review from timokau July 13, 2019 22:48
@abbradar
Copy link
Member Author

(For some reason I cannot request a review from @yorickvP on Github)

@abbradar
Copy link
Member Author

Updated binary libtensorflow doesn't build for Darwin. Can someone with Darwin knowledge help with it?

@danieldk
Copy link
Contributor

danieldk commented Jul 14, 2019

Updated binary libtensorflow doesn't build for Darwin. Can someone with Darwin knowledge help with it?

It seems that Google now build on newer macOS versions, I had the same problem with Tensorflow 1.13.1 when building on my own MacBook with Mojave. A newer cctools is needed, see #49371. I think this is the relevant PR: #61172

@yorickvP
Copy link
Contributor

This is great! I'm hoping to get a fully static build at some point, but having the same protobuf on tensorflow as the rest of the code should fix our linker errors! :)

You can't request a review from me because I don't have commit rights, I suppose.

Copy link
Contributor

@yorickvP yorickvP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoever built tensorflow upstream must have patched out that saved_model_portable_proto line. it's been in there for several version snow.

pkgs/build-support/build-bazel-package/default.nix Outdated Show resolved Hide resolved
@yorickvP yorickvP mentioned this pull request Jul 14, 2019
10 tasks
@timokau
Copy link
Member

timokau commented Jul 14, 2019

As I mentioned in #63616, there are still issues with the python2 build. Seems to be a general problem with parallelism on python2, only manifests itself here because its such a big project and basically hopeless to build without a lot of parallelism (used 12 cores).

I already added one patch to python2 (#64067), but I still saw the issue (sometimes, its transient) after that. I suspect the second part of the fix in the python ticket is also needed. That was never backported by debian, so I had to do it myself (being not very experienced with C). I've pushed those changes to the branch of the other PR. Its not very polished, which is why I didn't push it before.

I don't have time to work on this or give extensive review right now. But I can answer questions if you have any.

@abbradar
Copy link
Member Author

@timokau I saw that one but IIUC this is a general Python problem, not Tensorflow-specific one, so I didn't really look into it. It shouldn't happen during the "big" Tensorflow build because I only build "source" package now and build the wheel separately with our buildPythonProgram. I didn't observe the problem on 8-core (16-thread) machine during ~10 builds too.

@timokau
Copy link
Member

timokau commented Jul 14, 2019

Neat, so you only need bazel for the C part and don't need to unnecessarily rebuild everything for libtensorflow, py2 tensorflow and py3 tensorflow?

@abbradar
Copy link
Member Author

Sadly no, you still need to rebuild everything because their C libraries link to different Pythons and I presume use different APIs. Still, in the end you get something that resembles a usual source Python package with setup.py etc. I use this as src for buildPythonPackage so any Python bugs should be isolated there.

@timokau
Copy link
Member

timokau commented Jul 14, 2019

Oh, so you mean you just skip the final bdist_wheel as it was done previously? I actually changed that on purpose, so that we can unify the binary and source builds. That way we wouldn't have all the duplication between default.nix and bin.nix.

I don't think the bdist_wheel step alone was responsible for the build failures, but who knows.

@abbradar
Copy link
Member Author

The primary reason I switched back to the previous behavior was to ensure fixupPhase runs on unpacked Tensorflow libraries. This ensures RPATH is set up correctly.

@abbradar abbradar closed this Jul 14, 2019
@abbradar abbradar reopened this Jul 14, 2019
@yorickvP
Copy link
Contributor

Will this work with bazel 0.28.0? #64633

@timokau
Copy link
Member

timokau commented Jul 15, 2019

My PR had some issues with bazel 0.27, not sure if @abbradar worked around those. But the bazel team has promised no further backwards incompatibilities for the next 3 releases (then they'll break backwards compatibility once more and release 1.0).

@abbradar
Copy link
Member Author

But the bazel team has promised no further backwards incompatibilities for the next 3 releases (then they'll break backwards compatibility once more and release 1.0).

Nice news! I didn't test 0.28 though yet, wanted to do it today but sadly I won't have time for one more rebuild.

some issues with bazel 0.27, not sure if @abbradar worked around those

Huh, didn't notice anything. Maybe you remember something specific? I apply patch for 0.27 from upstream and also add a flag to enable backwards compatibility with some behavior.

@abbradar
Copy link
Member Author

Now that other parts are here I'll wait till staging goes into master and merge this (so that we don't break Python 2 + Tensorflow).

@timokau
Copy link
Member

timokau commented Jul 17, 2019

Why not just retarget this PR to staging and merge now? Then we'll get it automatically on the next staging-next merge.

@abbradar
Copy link
Member Author

abbradar commented Jul 17, 2019 via email

@timokau
Copy link
Member

timokau commented Jul 17, 2019

Fair enough.

@timokau
Copy link
Member

timokau commented Jul 18, 2019

I tried to build tensorflow with SSE4.1 support, but unfortunately the build failed. Apparently that has to be specified at configure time now. This is my attempt to get it working:
https://github.com/timokau/nixpkgs/commits/3323284
The patch to buildBazelPackages is necessary since tensorflow won't accept the --config=opt flag for fetching.

Feel free to cherry-pick and/or rebase.

timokau and others added 9 commits July 31, 2019 13:28
This also changes it to a from-source build.
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively.
Now the derivation builds both libtensorflow and the Python package and puts them into
different outputs.

Quite a bit of improvements were done on the top, including:

* Use official tag revision as source, not a branch;
* Use all system libraries possible (before only one was actually used);
* Move various environment variables to the derivation itself from hooks;
* Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries;
* And more that I forgot!
They sometimes take separate flags.
Now need to be passed in the configure phase.

abbradar: Don't change CUDA build hash.
@timokau
Copy link
Member

timokau commented Aug 24, 2019

@abbradar in case you haven't noticed yet, the tensorflow build (py2 and py3, but there are different issues for py2 and py3 I think) is currently failing. Unfortunately I don't have access to the necessary compute resources to debug this right now.

@abbradar
Copy link
Member Author

Fixed in 7109546. Thanks for the heads up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants