Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl #69454

Merged
merged 1 commit into from Mar 31, 2020

Conversation

jmillerpdt
Copy link
Contributor

@jmillerpdt jmillerpdt commented Sep 26, 2019

modified:   pkgs/development/python-modules/tensorflow/default.nix
Motivation for this change

Nix Tensorflow does not support optionally using the Intel provided optimizations for Intel MKL, which must be enabled through additional compile flags.

On modern Intel processors, this will generally provide improved performance and is recommended in Tensorflow's performance guide.

NOTE: This change depends on MKL-DNN, the open PR found here: #68014

I've tested the changes on Linux with the above mentioned PR applied, both with an without Intel MKL enabled to confirm it works in both cases. I do not have access to test on darwin, but believe that redirects to binary versions. I would appreciate if someone could test with cuda, as this may need to be disabled when cudaSupport is enabled (I can make the change, but not test that myself).

With MKL enabled tensorflow.python.framework.test_util.IsMklEnabled() returns True

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
Notify maintainers

cc @jyp @abbradar

@jmillerpdt jmillerpdt changed the title tensorflow: bugfix for bazel settings (intel mkl, dnnl) pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl Sep 26, 2019
@jmillerpdt jmillerpdt mentioned this pull request Oct 2, 2019
10 tasks
@timokau
Copy link
Member

timokau commented Oct 17, 2019

This is blocked on #68014, which has already been reviewed by people with commit access but for some reason not merged. I don't have time to review that right now.

This PR looks good to me, as long as mkl and dnnl are open source this should be a clear improvement :)

@timokau
Copy link
Member

timokau commented Oct 17, 2019

Although it would be nice if you could point out where exactly those guidelines are. You just linked to the benchmarks folder?

@jmillerpdt
Copy link
Contributor Author

@timokau It looks like Google pulled that page since releasing Tensorflow 2.0 and now is redirecting it. It's possible to use wayback to look at the historical page, but here is Intel's page on the same topic. And, FWIW, it seems Anaconda now defaults to using MKL.

Having these libraries should generally improve performance for those using CPUs. While the MKL-DNN would generally qualify under most definitions of free (>= 1.x), the core MKL is shipped as binary so would not be available in Nix by default and has to be enabled per the documentation. For many people and Hydra, this change will be a no-op. But for those performance focused, it would be easily available in a similar way to PyTorch and other libraries once they enable MKL in their site configuration.

It looks like you have 1.15 and 2.0 PR updates in progress, so I'm fine with this being rolled into those (I just don't want it to get lost). Tensorflow 2.0 now enables Cuda by default and I'm unsure of the interaction with MKL and the Cuda piece, so it may be that MKL should only be enabled when Cuda is not enabled. How you want to handle this depends on how you're planning to approach the "enabled by default" for Tensorflow 2.0 in Nix.

@timokau
Copy link
Member

timokau commented Oct 18, 2019

Ah right, I was in a bit of a hurry and didn't realize that users have to opt-in for mkl. Then the only downside I see here is that people opting in for mkl will have to rebuild tensorflow from source, which can take quite some time. Its probably a reasonable assumption that people that care enough for performance to enable mkl are willing to do this, but we should still give people that only want mkl for numpy (which doesn't take quite as long to rebuild) a way out.

So I suggest that instead of directly depending on numpy.blasImplementation, you instead add a new parameter mklSupport that defaults to numpy.blasImplementation = "mkl". That way people can still make an overlay and unconditionally enable or disable it.

Once that is done and the dnnl PR is merged I see no reason to hold this back. Doesn't have to be merged with the 1.15 update. That update could be blocked for as long as upstream needs to adjust tensorflow for bazel 1.0. Its a mystery to me why bazel would make its first major release without making sure that at least its biggest public consumers don't break, but that is a different question.

I think there should be no bad interaction with cuda, although if you have the hardware it would be nice if you could test that. The "cuda enabled by default" question is a difficult one, which I'll postpone to the actual 2.0 update (keeping it disabled for now).

@FRidh
Copy link
Member

FRidh commented Oct 19, 2019

So I suggest that instead of directly depending on numpy.blasImplementation, you instead add a new parameter mklSupport that defaults to numpy.blasImplementation = "mkl". That way people can still make an overlay and unconditionally enable or disable it.

I suppose tensorflow cannot work with MKL if numpy is built without? And why would you build without MKL if numpy is built with already? Note one has to perform a rebuild anyway because the hash of numpy changes when enabling MKL.

@timokau
Copy link
Member

timokau commented Oct 19, 2019

I suppose tensorflow cannot work with MKL if numpy is built without?

I'm not sure.

And why would you build without MKL if numpy is built with already? Note one has to perform a rebuild anyway because the hash of numpy changes when enabling MKL.

Right, good point. So in that case this PR should be fine as-is, as soon as the dependency is merged.

@jmillerpdt
Copy link
Contributor Author

I suppose tensorflow cannot work with MKL if numpy is built without?

I'm not sure.

You probably don't want to mix math libraries in your Python stack. If MKL is used, it should be applied consistently.

And why would you build without MKL if numpy is built with already? Note one has to perform a rebuild anyway because the hash of numpy changes when enabling MKL.

Right, good point. So in that case this PR should be fine as-is, as soon as the dependency is merged.

I attempted to test the enableCuda branch, but it looks like the recent merge of Bazel 1.0 (#69252) now breaks tensorflow since that version of Bazel is known to be incompatible.

You reference this in the WIP PRs, but I think it affects even the master branch.

@timokau
Copy link
Member

timokau commented Oct 20, 2019

I attempted to test the enableCuda branch, but it looks like the recent merge of Bazel 1.0 (#69252) now breaks tensorflow since that version of Bazel is known to be incompatible.

You reference this in the WIP PRs, but I think it affects even the master branch.

Right, we'll have to wait for upstream on this. Looks like someone is on it. Of course if you want to take a shot at fixing this yourself, feel free to :)

@FRidh
Copy link
Member

FRidh commented Oct 22, 2019

Note there is a PR for Bazel 1.1.
#71612

@timokau
Copy link
Member

timokau commented Oct 22, 2019

Bazel has promised to adhere to semver (and leave at least 3 months between breaking changes), so that shouldn't make the situation worse than it is. I would've preferred blocking the 1.0 update on tensorflow compatibility, but now I hope that the upstream support will come soon.

@flokli
Copy link
Contributor

flokli commented Oct 22, 2019

I'm also a bit surprised tensorflow was taken by surprise on that. Good a fix is in sight.

On the other hand, I also want to note bazel was merged to master, not release-19.09, and we call the channel nixos-unstable, so I'm personally comfortable with breaking some things, if the result is something better - we don't really block people relying on stability.

@FRidh
Copy link
Member

FRidh commented Feb 9, 2020

Is this still needed? Note there's a merge conflict.

@jmillerpdt
Copy link
Contributor Author

Yes, this is still desirable and it looks like the prerequisites should now be in place to allow forward progress. If you have time to review, I'll prioritize resolving the PR conflict over the coming week.

@jmillerpdt
Copy link
Contributor Author

@FRidh Merge conflict resolved. Retested for cpu, cpu-mkl, and cuda-mkl on Linux.

This is ready for review.

In [1]: from tensorflow.python.framework.test_util import IsMklEnabled
In [2]: IsMklEnabled()
Out[2]: True

@Ericson2314
Copy link
Member

I will apply this to both versions of tensorflow after the upgrade.

@Ericson2314
Copy link
Member

Thanks for the rebase. Did you ever try TF_NEED_MKL to match how cuda does it instead of the flag? I am trying that now.

@jmillerpdt
Copy link
Contributor Author

Thanks for the rebase. Did you ever try TF_NEED_MKL to match how cuda does it instead of the flag? I am trying that now.

No, I followed the instructions from here: https://www.tensorflow.org/install/source

I believe TF_NEED_MKL will download MKL, although I have not tried that approach.

@Ericson2314
Copy link
Member

Ericson2314 commented Mar 31, 2020

I dunno why ofborg didn't run; this is definitely fine though---I changed the assert to use the -> boolean implication operator for clarity, and the rest looks good (especially as the env var is bad).

@jonringer
Copy link
Contributor

@Ericson2314 IIRC mkl has a unfree license

@jonringer
Copy link
Contributor

and recently, ofborg is very slow (10+ hrs) in eval'ing a PR

@FRidh
Copy link
Member

FRidh commented Mar 31, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants