Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildBazelPackage: poison all the fixed output derivations #80953

Merged
merged 8 commits into from Mar 3, 2020

Conversation

andir
Copy link
Member

@andir andir commented Feb 24, 2020

Motivation for this change

This is an attempt at solving a long annoyance with buildBazelPackage.

Currently each fixed output derivation can happily be used throughout many different bazel versions. That might be good as the fetched containt might not have changed betwen (minor) bazel version changes. The risk here is that we update bazel, rebuild everything but are not longer able to actually reproduce the fixed output. This is currently the case with the Tensorflow expression.

We did update TF and/or bazel and right now (on master) it still builds but if you try to reproduce the fixed output content it doesn't work anymore.

With this PR I attempt to mitigate some of the issues with those fetchers. Each of the fixed output derivations will contain the name string of the bazel package that was used to build them. On the consuming end within buildBazelPackage that version string is compared against the current bazel name. If those do not match the build fails. This will avoid silently loosing the ability to reproduce the fixed output paths (in some cases).

I also added the ability to override the Bazel package that should be used for each build. This will be use in some future work.

Currently the TF build does not produce a valid fixed output (anymore?). This will likely require another Bazel version or yet another patch to make it work with the newer Bazel version. I would prefer investing time into having another Bazel version within Nixpkgs.

Opinions?

Previous discussions on this topic were in #76851

Future work
  • We should verify that all the store paths that are being referenced from within the fixed output derivation are still valid. If they are not valid we should fail the build.
    This is feasible and very naïve but would fail basically the day (after) the updated derivation hit Nixpkgs as something along the dependency path has been changed and thus invalidated all the store paths.

  • Another line of work is to have multiple versions of bazel within nixpkgs. This avoid us having to patch upstream sources to work with our always changing version of Bazel. It shouldn't harm anyone if we support multiple Bazel versions. We are doing this with compilers and various other things already.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • [] Fits CONTRIBUTING.md.

@Profpatsch
Copy link
Member

The risk here is that we update bazel, rebuild everything but are not longer able to actually reproduce the fixed output.

Does this mean that if you try to build it instead of fetching it from binary cache, the output hashes don’t match?

* We should verify that all the store paths that are being referenced from within the fixed output derivation are still valid. If they are not valid we should fail the build.

I’m assuming “hashes still valid” means that if you rebuild a fixed-output drv, it produces the specified output hash.

If those do not match the build fails. This will avoid silently loosing the ability to reproduce the fixed output paths (in some cases).

So this is just a faster error path, because it disallows using a fixed-output package that was specified for a different bazel version?

@andir
Copy link
Member Author

andir commented Feb 24, 2020

The risk here is that we update bazel, rebuild everything but are not longer able to actually reproduce the fixed output.

Does this mean that if you try to build it instead of fetching it from binary cache, the output hashes don’t match?

Yeah, that is the exact case we are running into with TF right now. You would probably have to go back to Bazel 0.x (and the correct nixpkgs revision?) to reproduce the output.

* We should verify that all the store paths that are being referenced from within the fixed output derivation are still valid. If they are not valid we should fail the build.

I’m assuming “hashes still valid” means that if you rebuild a fixed-output drv, it produces the specified output hash.

Yes, currently we never check that since we are just happily fetching from the binary cache. There might be references to the store that were valid during "fixed output build time" (e.g. a year ago) but do not match with the currently available paths. This is hard to solve as e.g. bash might be referenced in the fixed output and that regularly changes whenever the the stdenv changes. That is also why I would like to disallow any store path references in the fixed output. I do not know how feasible that is tho…

If those do not match the build fails. This will avoid silently loosing the ability to reproduce the fixed output paths (in some cases).

So this is just a faster error path, because it disallows using a fixed-output package that was specified for a different bazel version?

Yes, exactly. We now get a fast error as soon as (the) one of the obvious issue(s) arises.

@uri-canva
Copy link
Contributor

This is hard to solve as e.g. bash might be referenced in the fixed output and that regularly changes whenever the the stdenv changes.

I thought buildBazelPackage already disallowed all references to store paths in the fixed output, since it sets allowedRequisites = []. At least I know it's screamed at me often when I had store paths in it.

@andir
Copy link
Member Author

andir commented Feb 24, 2020 via email

@uri-canva
Copy link
Contributor

Isn't this an issue with fixed output derivations in general? Since they check the output hash before checking whether the inputs are the same you can change the inputs and still get the old cached value. This affects more than just bazel version changes, for example other buildInputs changes would also be potentially ignored like that. I think this issue should be addressed in fixed output derivations in general, not worked around with a specific hack like this.

@andir
Copy link
Member Author

andir commented Feb 25, 2020 via email

@Profpatsch
Copy link
Member

Isn't this an issue with fixed output derivations in general?

Yes, it’s inherent in their design. The person using them has to guarantee referential transparency, which in this case means “running the fixed-output builder is always going to produce the same result, up to some networking issues”. You can kinda assume that for fetchurl (for reasonably stable URLs, but that’s why nixpkgs has a mirror cache for that).

This means the onus here is on the creator of buildBazelPackage, and they have done a bad job by mis-/abusing the power of fixed output derivations, sacrificing correctness for temporary gain (and other people’s lifetime …).

The right solution is having a generator which goes from WORKSPACE files (which contain the relevant dependency’s hashes!) to a nix file with fetchurl calls, like we do for the build of bazel itself.

@Profpatsch
Copy link
Member

@andir this is still a draft, do you still want to change something?

@andir
Copy link
Member Author

andir commented Feb 25, 2020 via email

@uri-canva
Copy link
Contributor

I see, that makes sense, thanks for the context. I believe https://docs.bazel.build/versions/2.0.0/command-line-reference.html#flag--repository_cache might be helpful here: the fixed output derivation currently contains all the external repositories, which contain a lot of files that aren't downloaded. That's why there's so much work the derivations that use buildBazelPackage have to do to delete directories from it. The repository cache on the other hand only contains downloaded files, and should be compatible across versions of both bazel and its dependencies. We can then point bazel to the repository cache we create in the fixed output derivation, and it should be able to recreate all of the external repositories without hitting the network (assuming all the repository rules use methods that support the repository cache to download files, for example that won't work for bazel projects that use rules_docker, since that has its own caching mechanism). See https://bazel.build/designs/2016/09/30/repository-cache.html for the design of it.

That should behave much more like we expect fixed output derivations to behave.

To be clear I'm not suggesting this be done instead of this PR, I understand this PR is also helpful in the short term and I don't mean to block it, just suggesting how this could be made better going forward.

@andir
Copy link
Member Author

andir commented Feb 25, 2020 via email

@uri-canva
Copy link
Contributor

I assume it hasn't been experimental for a while now, the doc refers to it as experimental, but that was back in 2016.

@Profpatsch
Copy link
Member

Profpatsch commented Feb 25, 2020

It would probably make sense to get Tensorflow working again (notice that commit with ^FIXME:). I will try to get an ancient bazel version reincarnated throughout the day.

Doesn’t need to be in this PR though, let’s merge it as is if possible

@Profpatsch
Copy link
Member

We can then point bazel to the repository cache we create in the fixed output derivation

What is the repository cache generated from? Iirc it doesn’t have to be from pinned stuff in the WORKSPACE file. In that case we have the same problem again, so I don’t think that’s a good solution.

We need a lockfile, similar to how all other modern language ecosystems handle it. There has been some work in that direction: https://blog.bazel.build/2018/09/28/first-class-resolved-file.html

@andir
Copy link
Member Author

andir commented Feb 25, 2020 via email

@mjlbach
Copy link
Contributor

mjlbach commented Feb 25, 2020

@andir sounds good :) see pr #81033. One thing to note is I had to add a patch to glibc to get bazel building. I also have tensorflow working with a glibc patch I will submit (https://github.com/mjlbach/nixpkgs/tree/tensorflow_1_compat) see here for the fix (still need to update the sha on non-cuda builds) mjlbach@a5cca6b. But this requires making the version of Bazel for buildBazel configurable, as currently I left bazel_2 as the default bazel package.

@mjlbach
Copy link
Contributor

mjlbach commented Feb 25, 2020

@andir Also see #81035 The one last piece is configuring buildbazel to build with either bazel_1 or bazel (bazel_2) which I have not yet had time to explore. comments on the PRs welcome.

@Profpatsch
Copy link
Member

One thing to note is I had to add a patch to glibc to get bazel building.

This sounds like it’s waaaay out of scope for this PR. Can we merge what we have?

@mjlbach
Copy link
Contributor

mjlbach commented Feb 26, 2020

Yes, sorry for the cross pollination from #81033

@andir andir marked this pull request as ready for review March 2, 2020 16:26
@andir
Copy link
Member Author

andir commented Mar 2, 2020

Ready for final review @Profpatsch @mjlbach

I had to add an older version of tensorflow-estimator as that had been updated to an incompatible version in #81278.

All bazel fixed output derivations should be specific to the bazel
version that was used to generate them. There is not guarantee that the
build will still succeed or reproduces (without the cached fixed output)
if the fetch phase wasn't rerun with a different bazel version.

In the past bazel had been bumped but not all those packages that have
fixed outputs from bazel builds. This lead to compiling and somewhat
working TF versions that couldn't be reproduced without the cached fixed
outputs.
TF 1.15 still needs an older version of the tensorflow-estimator
package.
@andir andir merged commit 846f300 into NixOS:master Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants