New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buildBazelPackage: poison all the fixed output derivations #80953
Conversation
Does this mean that if you try to build it instead of fetching it from binary cache, the output hashes don’t match?
I’m assuming “hashes still valid” means that if you rebuild a fixed-output drv, it produces the specified output hash.
So this is just a faster error path, because it disallows using a fixed-output package that was specified for a different bazel version? |
Yeah, that is the exact case we are running into with TF right now. You would probably have to go back to Bazel 0.x (and the correct nixpkgs revision?) to reproduce the output.
Yes, currently we never check that since we are just happily fetching from the binary cache. There might be references to the store that were valid during "fixed output build time" (e.g. a year ago) but do not match with the currently available paths. This is hard to solve as e.g. bash might be referenced in the fixed output and that regularly changes whenever the the stdenv changes. That is also why I would like to disallow any store path references in the fixed output. I do not know how feasible that is tho…
Yes, exactly. We now get a fast error as soon as (the) one of the obvious issue(s) arises. |
I thought |
On 13:04 24.02.20, Uri Baghin wrote:
> This is hard to solve as e.g. bash might be referenced in the fixed
> output and that regularly changes whenever the the stdenv changes.
I thought `buildBazelPackage` already disallowed all references to
store paths in the fixed output, since it sets `allowedRequisites =
[]`. At least I know it's screamed at me often when I had store paths
in it.
You are probably right. I confused that with some other behaviour that I
was looking at. Then the situation isn't as bad as I expected. Less work
to be done. :-)
|
Isn't this an issue with fixed output derivations in general? Since they check the output hash before checking whether the inputs are the same you can change the inputs and still get the old cached value. This affects more than just bazel version changes, for example other |
(This turned out to be a longer response then initially anticipated.
Just thought giving more context ot my thoughts is probably not a bad
idea.)
On 16:09 24.02.20, Uri Baghin wrote:
Isn't this an issue with fixed output derivations in general? Since
they check the output hash before checking whether the inputs are the
same you can change the inputs and still get the old cached value.
Yes that is true. Do you have an alternative approach to implementing
this? Keep in mind that all the usual fetchers used within Nixpkgs are
also just fixed output derivations. There we do not care what version of
curl was used. We only care about the resulting hash being the same.
Fixed output derivations being abused like they are in
`buildBazelPackage` and `buildRustPackage` is the issue here. They
shouldn't be used to generate code. They should just fetch something and
be done.
This affects more than just bazel version changes, for example other
`buildInputs` changes would also be potentially ignored like that. I
think this issue should be addressed in fixed output derivations in
general, not worked around with a specific hack like this.
Yes that is an issue. IMO the entire "fixed output" fetching using bazel
(or cargo or whatever) is a hack. They do not operate on a single URL
that they fetch but they fetch a complex tree of directories (with
generated files?).
As I tried to explain in the PR description tracking all the inputs can
go both ways. It can either mean that whenever someone changes stdenv
you'll have to update all the fixed output hashes. This yields a great
amount of useless work nobody wants to do. The other extreme is the
current situation (before this PR) where nothing actually invalidates
the fixed outputs and thus we continue using stuff that isn't
reproducible anymore.
In the past we took a similar route (to this PR) within the
`buildRustPackage` expression. There `cargo vendor` is used to download
all the dependencies before the build is executed. In contrast to what
bazel seems to be doing `cargo vendor` is just a fancy way to iterate
through URLs and writing those files to a folder.
In the rust world we are comparing the lockfile that is part of the
fixed output to the lockfile that the sources ship:
https://github.com/NixOS/nixpkgs/blob/master/pkgs/build-support/rust/default.nix#L118
While in the rust ecosystem the way to fetch things is trivial (just
invoke `cargo vendor`) and gives you the same outputs every time the
same doesn't seem to be true for Bazel. As you at least pass a few
arguments to the specific bazel version.
I am not trying to solve the entire problem space here. I am trying to
make our lifes better by not having silently rotting expressions in
nixpkgs. I lost many days of my life by debugging why tensorflow
wouldn't build anymore. It doesn't help if the fixed outputs aren't
reproducible and you only notice that a year later.
The build system should at least yell at you if you generate them from
another bazel version. Arguable they should probably also bark at you if
the `src` attribute of the package changes.
This PR is about finding an intermediate solution. If you have a great
idea to fix all of the fixed output pains I am all ears and I am very
happy to look into it. For the time being something like this is IMHO
eagerly needed.
|
Yes, it’s inherent in their design. The person using them has to guarantee referential transparency, which in this case means “running the fixed-output builder is always going to produce the same result, up to some networking issues”. You can kinda assume that for fetchurl (for reasonably stable URLs, but that’s why nixpkgs has a mirror cache for that). This means the onus here is on the creator of The right solution is having a generator which goes from |
@andir this is still a draft, do you still want to change something? |
On 03:15 25.02.20, Profpatsch wrote:
@andir this is still a draft, do you still want to change something?
It would probably make sense to get Tensorflow working again (notice
that commit with ^FIXME:).
I will try to get an ancient bazel version reincarnated throughout the
day.
|
I see, that makes sense, thanks for the context. I believe https://docs.bazel.build/versions/2.0.0/command-line-reference.html#flag--repository_cache might be helpful here: the fixed output derivation currently contains all the external repositories, which contain a lot of files that aren't downloaded. That's why there's so much work the derivations that use That should behave much more like we expect fixed output derivations to behave. To be clear I'm not suggesting this be done instead of this PR, I understand this PR is also helpful in the short term and I don't mean to block it, just suggesting how this could be made better going forward. |
On 03:40 25.02.20, Uri Baghin wrote:
I see, that makes sense, thanks for the context. I believe
https://docs.bazel.build/versions/2.0.0/command-line-reference.html#flag--repository_cache
might be helpful here: the fixed output derivation currently contains
all the external repositories, which contain a lot of files that
aren't downloaded. That's why there's so much work the derivations
that use `buildBazelPackage` have to do to delete directories from it.
The repository cache on the other hand only contains downloaded files,
and should be compatible across versions of both bazel and its
dependencies. We can then point bazel to the repository cache we
create in the fixed output derivation, and it should be able to
recreate all of the external repositories without hitting the network
(assuming all the repository rules use methods that support the
repository cache to download files, for example that won't work for
bazel projects that use `rules_docker`, since that has its own caching
mechanism). See
https://bazel.build/designs/2016/09/30/repository-cache.html for the
design of it.
That should behave much more like we expect fixed output derivations
to behave.
Yeah, that really looks like a good alternative approach. Looking at the
documentation of older bazel commands makes me feel confident that we
could just use that across all versions (for now). Whats the policy for
experimental commands in Bazel? Can they go away at any (major) version
change? I wouldn't want to put in time for that if that goes away in a
version or two without a proper replacement.
To be clear I'm not suggesting this be done instead of this PR, I
understand this PR is also helpful in the short term and I don't mean
to block it, just suggesting how this could be made better going
forward.
Understood. I think the way you pointed out is the right way forward in
the long term.
|
I assume it hasn't been experimental for a while now, the doc refers to it as experimental, but that was back in 2016. |
Doesn’t need to be in this PR though, let’s merge it as is if possible |
What is the repository cache generated from? Iirc it doesn’t have to be from pinned stuff in the WORKSPACE file. In that case we have the same problem again, so I don’t think that’s a good solution. We need a lockfile, similar to how all other modern language ecosystems handle it. There has been some work in that direction: https://blog.bazel.build/2018/09/28/first-class-resolved-file.html |
I will try to get an ancient bazel version reincarnated throughout the
day.
Just saw the work that @mjlbach did on his fork trying to get bazel_1 to
work additionally (https://github.com/mjlbach/nixpkgs/commits/bazel_1).
I would recommend we join forces here and land multiple bazel versions
before we merge this PR.
@mjlbach: Let me know how we can collaborate on this. You changes look
good already. What is still missing?
|
@andir sounds good :) see pr #81033. One thing to note is I had to add a patch to glibc to get bazel building. I also have tensorflow working with a glibc patch I will submit (https://github.com/mjlbach/nixpkgs/tree/tensorflow_1_compat) see here for the fix (still need to update the sha on non-cuda builds) mjlbach@a5cca6b. But this requires making the version of Bazel for buildBazel configurable, as currently I left bazel_2 as the default bazel package. |
This sounds like it’s waaaay out of scope for this PR. Can we merge what we have? |
Yes, sorry for the cross pollination from #81033 |
Ready for final review @Profpatsch @mjlbach I had to add an older version of tensorflow-estimator as that had been updated to an incompatible version in #81278. |
All bazel fixed output derivations should be specific to the bazel version that was used to generate them. There is not guarantee that the build will still succeed or reproduces (without the cached fixed output) if the fetch phase wasn't rerun with a different bazel version. In the past bazel had been bumped but not all those packages that have fixed outputs from bazel builds. This lead to compiling and somewhat working TF versions that couldn't be reproduced without the cached fixed outputs.
TF 1.15 still needs an older version of the tensorflow-estimator package.
Motivation for this change
This is an attempt at solving a long annoyance with
buildBazelPackage
.Currently each fixed output derivation can happily be used throughout many different bazel versions. That might be good as the fetched containt might not have changed betwen (minor) bazel version changes. The risk here is that we update bazel, rebuild everything but are not longer able to actually reproduce the fixed output. This is currently the case with the Tensorflow expression.
We did update TF and/or bazel and right now (on master) it still builds but if you try to reproduce the fixed output content it doesn't work anymore.
With this PR I attempt to mitigate some of the issues with those fetchers. Each of the fixed output derivations will contain the
name
string of thebazel
package that was used to build them. On the consuming end withinbuildBazelPackage
that version string is compared against the current bazel name. If those do not match the build fails. This will avoid silently loosing the ability to reproduce the fixed output paths (in some cases).I also added the ability to override the Bazel package that should be used for each build. This will be use in some future work.
Currently the TF build does not produce a valid fixed output (anymore?). This will likely require another Bazel version or yet another patch to make it work with the newer Bazel version. I would prefer investing time into having another Bazel version within Nixpkgs.
Opinions?
Previous discussions on this topic were in #76851
Future work
We should verify that all the store paths that are being referenced from within the fixed output derivation are still valid. If they are not valid we should fail the build.
This is feasible and very naïve but would fail basically the day (after) the updated derivation hit Nixpkgs as something along the dependency path has been changed and thus invalidated all the store paths.
Another line of work is to have multiple versions of bazel within nixpkgs. This avoid us having to patch upstream sources to work with our always changing version of Bazel. It shouldn't harm anyone if we support multiple Bazel versions. We are doing this with compilers and various other things already.
Things done
sandbox
innix.conf
on non-NixOS linux)nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
./result/bin/
)nix path-info -S
before and after)