pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl #69454

jmillerpdt · 2019-09-26T05:27:28Z

modified:   pkgs/development/python-modules/tensorflow/default.nix

Motivation for this change

Nix Tensorflow does not support optionally using the Intel provided optimizations for Intel MKL, which must be enabled through additional compile flags.

On modern Intel processors, this will generally provide improved performance and is recommended in Tensorflow's performance guide.

NOTE: This change depends on MKL-DNN, the open PR found here: #68014

I've tested the changes on Linux with the above mentioned PR applied, both with an without Intel MKL enabled to confirm it works in both cases. I do not have access to test on darwin, but believe that redirects to binary versions. I would appreciate if someone could test with cuda, as this may need to be disabled when cudaSupport is enabled (I can make the change, but not test that myself).

With MKL enabled tensorflow.python.framework.test_util.IsMklEnabled() returns True

Things done

Notify maintainers

cc @jyp @abbradar

timokau · 2019-10-17T20:10:44Z

This is blocked on #68014, which has already been reviewed by people with commit access but for some reason not merged. I don't have time to review that right now.

This PR looks good to me, as long as mkl and dnnl are open source this should be a clear improvement :)

timokau · 2019-10-17T20:11:41Z

Although it would be nice if you could point out where exactly those guidelines are. You just linked to the benchmarks folder?

jmillerpdt · 2019-10-18T13:33:54Z

@timokau It looks like Google pulled that page since releasing Tensorflow 2.0 and now is redirecting it. It's possible to use wayback to look at the historical page, but here is Intel's page on the same topic. And, FWIW, it seems Anaconda now defaults to using MKL.

Having these libraries should generally improve performance for those using CPUs. While the MKL-DNN would generally qualify under most definitions of free (>= 1.x), the core MKL is shipped as binary so would not be available in Nix by default and has to be enabled per the documentation. For many people and Hydra, this change will be a no-op. But for those performance focused, it would be easily available in a similar way to PyTorch and other libraries once they enable MKL in their site configuration.

It looks like you have 1.15 and 2.0 PR updates in progress, so I'm fine with this being rolled into those (I just don't want it to get lost). Tensorflow 2.0 now enables Cuda by default and I'm unsure of the interaction with MKL and the Cuda piece, so it may be that MKL should only be enabled when Cuda is not enabled. How you want to handle this depends on how you're planning to approach the "enabled by default" for Tensorflow 2.0 in Nix.

timokau · 2019-10-18T14:17:32Z

Ah right, I was in a bit of a hurry and didn't realize that users have to opt-in for mkl. Then the only downside I see here is that people opting in for mkl will have to rebuild tensorflow from source, which can take quite some time. Its probably a reasonable assumption that people that care enough for performance to enable mkl are willing to do this, but we should still give people that only want mkl for numpy (which doesn't take quite as long to rebuild) a way out.

So I suggest that instead of directly depending on numpy.blasImplementation, you instead add a new parameter mklSupport that defaults to numpy.blasImplementation = "mkl". That way people can still make an overlay and unconditionally enable or disable it.

Once that is done and the dnnl PR is merged I see no reason to hold this back. Doesn't have to be merged with the 1.15 update. That update could be blocked for as long as upstream needs to adjust tensorflow for bazel 1.0. Its a mystery to me why bazel would make its first major release without making sure that at least its biggest public consumers don't break, but that is a different question.

I think there should be no bad interaction with cuda, although if you have the hardware it would be nice if you could test that. The "cuda enabled by default" question is a difficult one, which I'll postpone to the actual 2.0 update (keeping it disabled for now).

FRidh · 2019-10-19T07:47:50Z

So I suggest that instead of directly depending on numpy.blasImplementation, you instead add a new parameter mklSupport that defaults to numpy.blasImplementation = "mkl". That way people can still make an overlay and unconditionally enable or disable it.

I suppose tensorflow cannot work with MKL if numpy is built without? And why would you build without MKL if numpy is built with already? Note one has to perform a rebuild anyway because the hash of numpy changes when enabling MKL.

timokau · 2019-10-19T08:36:12Z

I suppose tensorflow cannot work with MKL if numpy is built without?

I'm not sure.

And why would you build without MKL if numpy is built with already? Note one has to perform a rebuild anyway because the hash of numpy changes when enabling MKL.

Right, good point. So in that case this PR should be fine as-is, as soon as the dependency is merged.

jmillerpdt · 2019-10-20T04:05:32Z

I suppose tensorflow cannot work with MKL if numpy is built without?

I'm not sure.

You probably don't want to mix math libraries in your Python stack. If MKL is used, it should be applied consistently.

And why would you build without MKL if numpy is built with already? Note one has to perform a rebuild anyway because the hash of numpy changes when enabling MKL.

Right, good point. So in that case this PR should be fine as-is, as soon as the dependency is merged.

I attempted to test the enableCuda branch, but it looks like the recent merge of Bazel 1.0 (#69252) now breaks tensorflow since that version of Bazel is known to be incompatible.

You reference this in the WIP PRs, but I think it affects even the master branch.

timokau · 2019-10-20T21:29:43Z

I attempted to test the enableCuda branch, but it looks like the recent merge of Bazel 1.0 (#69252) now breaks tensorflow since that version of Bazel is known to be incompatible.

You reference this in the WIP PRs, but I think it affects even the master branch.

Right, we'll have to wait for upstream on this. Looks like someone is on it. Of course if you want to take a shot at fixing this yourself, feel free to :)

FRidh · 2019-10-22T12:39:13Z

Note there is a PR for Bazel 1.1.
#71612

timokau · 2019-10-22T12:55:40Z

Bazel has promised to adhere to semver (and leave at least 3 months between breaking changes), so that shouldn't make the situation worse than it is. I would've preferred blocking the 1.0 update on tensorflow compatibility, but now I hope that the upstream support will come soon.

flokli · 2019-10-22T21:20:10Z

I'm also a bit surprised tensorflow was taken by surprise on that. Good a fix is in sight.

On the other hand, I also want to note bazel was merged to master, not release-19.09, and we call the channel nixos-unstable, so I'm personally comfortable with breaking some things, if the result is something better - we don't really block people relying on stability.

FRidh · 2020-02-09T09:34:16Z

Is this still needed? Note there's a merge conflict.

jmillerpdt · 2020-02-11T16:56:11Z

Yes, this is still desirable and it looks like the prerequisites should now be in place to allow forward progress. If you have time to review, I'll prioritize resolving the PR conflict over the coming week.

jmillerpdt · 2020-03-11T04:37:48Z

@FRidh Merge conflict resolved. Retested for cpu, cpu-mkl, and cuda-mkl on Linux.

This is ready for review.

In [1]: from tensorflow.python.framework.test_util import IsMklEnabled
In [2]: IsMklEnabled()
Out[2]: True

Ericson2314 · 2020-03-28T17:54:54Z

I will apply this to both versions of tensorflow after the upgrade.

pkgs/development/python-modules/tensorflow/default.nix

Ericson2314 · 2020-03-30T20:23:11Z

Thanks for the rebase. Did you ever try TF_NEED_MKL to match how cuda does it instead of the flag? I am trying that now.

jmillerpdt · 2020-03-30T21:24:03Z

Thanks for the rebase. Did you ever try TF_NEED_MKL to match how cuda does it instead of the flag? I am trying that now.

No, I followed the instructions from here: https://www.tensorflow.org/install/source

I believe TF_NEED_MKL will download MKL, although I have not tried that approach.

Ericson2314 · 2020-03-31T16:46:28Z

I dunno why ofborg didn't run; this is definitely fine though---I changed the assert to use the -> boolean implication operator for clarity, and the rest looks good (especially as the env var is bad).

jonringer · 2020-03-31T18:06:59Z

@Ericson2314 IIRC mkl has a unfree license

jonringer · 2020-03-31T18:07:48Z

and recently, ofborg is very slow (10+ hrs) in eval'ing a PR

FRidh · 2020-03-31T18:11:20Z

It's indeed been accumulating jobs https://monitoring.nix.ci/d/000000004/evaluation-jobs-over-last-7-days?orgId=1&from=1585505457363&to=1585678257363

jmillerpdt requested a review from FRidh as a code owner September 26, 2019 05:27

ofborg bot added 6.topic: python 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels Sep 26, 2019

jmillerpdt changed the title ~~tensorflow: bugfix for bazel settings (intel mkl, dnnl)~~ pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl Sep 26, 2019

jmillerpdt mentioned this pull request Oct 2, 2019

pytorch: 1.0.0 -> 1.2.0 #65041

Merged

10 tasks

FRidh requested review from timokau and abbradar October 17, 2019 12:03

jmillerpdt force-pushed the bugfix/tensorflow-mkl branch from a8fc881 to 0c1e158 Compare March 6, 2020 18:22

jmillerpdt requested a review from jonringer as a code owner March 6, 2020 18:22

ofborg bot added 10.rebuild-darwin: 11-100 10.rebuild-linux: 11-100 and removed 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels Mar 6, 2020

jmillerpdt force-pushed the bugfix/tensorflow-mkl branch 2 times, most recently from ce3211a to 5006c71 Compare March 10, 2020 14:28

matthewbauer reviewed Mar 30, 2020

View reviewed changes

pkgs/development/python-modules/tensorflow/default.nix Outdated Show resolved Hide resolved

jmillerpdt force-pushed the bugfix/tensorflow-mkl branch from 5006c71 to 47f4533 Compare March 30, 2020 14:54

timokau reviewed Mar 30, 2020

View reviewed changes

pkgs/development/python-modules/tensorflow/default.nix Outdated Show resolved Hide resolved

jmillerpdt force-pushed the bugfix/tensorflow-mkl branch from 47f4533 to ac6e8b2 Compare March 30, 2020 17:04

matthewbauer approved these changes Mar 30, 2020

View reviewed changes

jmillerpdt force-pushed the bugfix/tensorflow-mkl branch from ac6e8b2 to 30bda2a Compare March 30, 2020 19:14

ofborg bot added 10.rebuild-darwin: 0 10.rebuild-linux: 0 and removed 10.rebuild-darwin: 11-100 10.rebuild-linux: 11-100 labels Mar 30, 2020

tensorflow: bugfix for bazel settings (intel mkl, dnnl)

4a4b448

Ericson2314 force-pushed the bugfix/tensorflow-mkl branch from 30bda2a to 4a4b448 Compare March 31, 2020 16:01

Ericson2314 merged commit be0d1dc into NixOS:master Mar 31, 2020

Ericson2314 mentioned this pull request Mar 31, 2020

pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl for 20.03 #83882

Merged

10 tasks

Janik-Haag added the 12. first-time contribution label Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl #69454

pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl #69454

jmillerpdt commented Sep 26, 2019 •

edited

timokau commented Oct 17, 2019

timokau commented Oct 17, 2019

jmillerpdt commented Oct 18, 2019

timokau commented Oct 18, 2019

FRidh commented Oct 19, 2019

timokau commented Oct 19, 2019

jmillerpdt commented Oct 20, 2019

timokau commented Oct 20, 2019

FRidh commented Oct 22, 2019

timokau commented Oct 22, 2019

flokli commented Oct 22, 2019

FRidh commented Feb 9, 2020

jmillerpdt commented Feb 11, 2020

jmillerpdt commented Mar 11, 2020

Ericson2314 commented Mar 28, 2020

Ericson2314 commented Mar 30, 2020

jmillerpdt commented Mar 30, 2020

Ericson2314 commented Mar 31, 2020 •

edited

jonringer commented Mar 31, 2020

jonringer commented Mar 31, 2020

FRidh commented Mar 31, 2020

pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl #69454

pythonPackages.tensorflow: fix for bazel settings for intel mkl, dnnl #69454

Conversation

jmillerpdt commented Sep 26, 2019 • edited

Motivation for this change

Things done

Notify maintainers

timokau commented Oct 17, 2019

timokau commented Oct 17, 2019

jmillerpdt commented Oct 18, 2019

timokau commented Oct 18, 2019

FRidh commented Oct 19, 2019

timokau commented Oct 19, 2019

jmillerpdt commented Oct 20, 2019

timokau commented Oct 20, 2019

FRidh commented Oct 22, 2019

timokau commented Oct 22, 2019

flokli commented Oct 22, 2019

FRidh commented Feb 9, 2020

jmillerpdt commented Feb 11, 2020

jmillerpdt commented Mar 11, 2020

Ericson2314 commented Mar 28, 2020

Ericson2314 commented Mar 30, 2020

jmillerpdt commented Mar 30, 2020

Ericson2314 commented Mar 31, 2020 • edited

jonringer commented Mar 31, 2020

jonringer commented Mar 31, 2020

FRidh commented Mar 31, 2020

jmillerpdt commented Sep 26, 2019 •

edited

Ericson2314 commented Mar 31, 2020 •

edited