Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0084] Input-aware fetchers #84

Closed
wants to merge 2 commits into from

Conversation

infinisil
Copy link
Member

@infinisil infinisil commented Dec 30, 2020

Change the name of the main fixed-output derivations to contain a hash of all inputs that could affect the output hash, such that if any of them change, a previously-fetched path with the same hash won't be reused. This resolves the long-standing confusion about Nix using an old version of a source even though the url it should be fetched from was updated.

Rendered

@grahamc
Copy link
Member

grahamc commented Dec 30, 2020 via email

@7c6f434c
Copy link
Member

Isn't it simpler to essentialy (finally!) revert
NixOS/nixpkgs@c3255fe ?

I think in practice basename of the URL is almost always good enough. And also readable unlike the proposal. And can be slightly tuned to be less dependent on the fetching method. Of course, when something better is needed, explicit name evaluated to include the version also works…

@holymonson
Copy link

Great idea, here is my 2 cents:

  1. Should we consider base36 or base32 instead of base64? Regarding to case-insensitive filesystem.
  2. Pro truncating hash, 6 or 8 digits is enough, we only need it to indicate the output maybe changed and trigger refetch.
  3. Pro appending filename, friend to human with barely no cost.

@infinisil
Copy link
Member Author

@7c6f434c Yeah that would be another way. But as you mentioned, it only works when the basename of the URL is the part that changes, which doesn't have to be the case. This means that in some cases, people will still have to change the hash to something invalid to force a fetch. The goal of this RFC is that people can expect to never have to do that.

A minor problem is also that the basename of the URL can include query parameters, which have to be escaped or removed properly (see NixOS/nixpkgs#107515). Not impossible, but another non-trivial case.

By using a fixed-size hash we can be sure that no matter how many inputs can influence the result, any change of any of them leads to a rebuild.

Comment on lines +131 to +134
[alternatives]: #alternatives

- The trivial alternative of keeping the status quo. This leads to the common beginner problem described in the motivation section.
- Adding a special `outputHashInputs` attribute to the Nix `derivation` primitive, which can be set to any attributes. These attributes then influence Nix's store path hash directly, without the need for using the derivation name as a hash influencer. This could be a much nicer solution, but is a much more indepth change to how Nix works.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to highlight this alternative I haven't considered a whole lot. Implementing this is a bit harder, but doing that would allow us to have the best of both worlds: Arbitrary inputs that cause a rebuild, but no increase of store path sizes. (<- @holymonson)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could be a very good alternative. This would also create a more obvious “disable if nix is to old” situation for nixpkgs.

@7c6f434c
Copy link
Member

7c6f434c commented Dec 30, 2020 via email

| --- | --- |
| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL |
| `pkgs.fetchzip`, `<nix/fetchurl.nix>` (with `unpack = true`), `builtins.fetchTarball`, `nix-prefetch-url --unpack` | "fetchurl-unpack-" + URL |
| `pkgs.fetchgit`, `builtins.fetchGit` and co. | "fetchgit-" + url + "-" + rev |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkgs.fetchgit would also need to have deepClone, leaveDotGit and fetchSubmodules included, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, though we'll have to make sure that it harmonizes with builtins.fetchGit which doesn't support deepClone and leaveDotGit. So it would probably have to be something like

"fetchgit-${optionalString deepClone "deepclone-"}${optionalString leaveDotGit "leavedotgit-"}${optionalString fetchSubmodules "fetchsubmodules-"}..."

@lheckemann
Copy link
Member

Another option, though it requires mass changes in nixpkgs and doesn't address all cases, would be mandating a name for fetchurl, and introducing the convention of setting it to ${pname}-${version}-src or similar. I don't think this is a great idea, but I'd prefer a hybrid approach (use URL basename, require an explicit name if that isn't a valid store path name) over the RFC's current proposal.

```
[nix-shell:~]$ echo -n 'example string' | openssl dgst -sha256 -binary | openssl base64 -A | cut -b1-42 | tr +/ -_
rt-5KzBTohoRT08wGgKjxq1d_1BNEk3CzuYRdiPuxw
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most sources, the pname and version attributes are probably enough to invalidate the inputs. One of the benefit is that those could be included in the name, making the store path more readable.

Another benefit of using pname and version is that it allow to invert the control. Historically the pname is first set on the package derivation, and then inherited in the source. In cases where we have a package generator function, like with most languages, the pname could be set on the source, and then inherited by the generator function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this rfc is mainly for minority, such as git tarball with revision or patch with patchutils arguments, with no pname nor version. As for normal sources, new pname and version will generate a new name, which could already trigger refetch now. (Store::makeFixedOutputPath takes name as paramater and generate a different derivation path.)

We may want something other than hash or name could affect fixedOutputPath.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holymonson if I'm interpreting zimbatm's comment right, he's suggesting that name/pname/version be moved into the source rather than the derivation that builds the source.

| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL |
| `pkgs.fetchzip`, `<nix/fetchurl.nix>` (with `unpack = true`), `builtins.fetchTarball`, `nix-prefetch-url --unpack` | "fetchurl-unpack-" + URL |
| `pkgs.fetchgit`, `builtins.fetchGit` and co. | "fetchgit-" + url + "-" + rev |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth talking about fetchers that download a bunch of things, like the Rust cargoSha256 and Go vendorSha256? They are a bit different in nature than the above-mentioned fetchers.

One trick we did with Cargo is to add the Cargo.lock in the derivation itself, and then compare it in the main derivation to make sure they are the same. This was done as a workaround for the issue mentioned in this RFC. So if a good solution is found, we could remove that hack.

@edolstra edolstra changed the title Input-aware fetchers [RFC 0084] Input-aware fetchers Jan 4, 2021
@edolstra
Copy link
Member

edolstra commented Jan 4, 2021

Not in favor of this. It goes the wrong way, namely away from content-addressability. (In CA Nix, ideally we wouldn't have a name attribute...) It duplicates source tarballs that could otherwise be deduplicated. This hurts long-term reproducibility from source, since it creates the possibility that a source tarball exists in the binary cache, but not under the right name.

Also, including a hash of the URL in the name negates the point of fixed-output derivations, which is to ensure that you can update a URL without causing a global rebuild.

@7c6f434c
Copy link
Member

7c6f434c commented Jan 4, 2021 via email

@sternenseemann
Copy link
Member

I feel like the problem is that there are situations which you want nix to be input aware and other situation when you specifically don't want it. I. e. when working in your local nixpkgs checkout you definitely want to rebuild fixed-output derivations with changed inputs, but when rebuilding your system or building packages on hydra you don't want it at all.

I'm not sure whether there is a proper solution which preserves both use cases though.

@garbas
Copy link
Member

garbas commented Jan 5, 2021

I find the reason (and the problem it tries to solve) for this RFC to be spot on. A lot of times I have seen newcomers and even more experiences Nix users to spend hours debugging because they forgot to update the sha as well as the url.

I understand why this is done the way it is and it even becomes more obvious with CA, but I wonder what would a solution to this UX problem look like. If there is even a solution that would make everybody happy or to think about the trade-off we want to make.

I don't really have a proper solution, but I only want to voice the opinion that this is an common problem which most likely every newcomer will experienced.

@aij
Copy link

aij commented Jan 6, 2021

I've been bitten by this too, but changing the output path for fetchers seems like an odd workaround that only works for some instances of this problem. The first time I encountered a changed URL with the same expected SHA not being re-fetched, I too was confused at first, but as soon as I figured out what was going on it seemed obvious that it was not a bug in Nix but in my understanding. Of course if we already have a file with the same SHA, there's no reason to re-download it for a normal build. Normally, we do want to use things that are already in the nix store, but there are times when we don't.

So, the more general problem is that we sometimes do want to force something to be re-built.

Some examples where we don't want to use the existing output that's already in the store:

  1. Download a file from new URL.
  2. Download a file from the same URL. (eg: to check that it is still available and still serving the expected file)
  3. Re-run a test that previously passed. (eg: when debugging flaky tests)
  4. Re-build a packages that is already installed. (eg: for checking reproducibility / timing / different nix-config options)

@infinisil
Copy link
Member Author

Agreed, mirror list should be updateable without rebuilds. As it it risks (depending on implementation details) causing a full rebuild on mirror://gnu/ mirror list update or something.

Actually that's no problem, because the mirrors for mirror:// urls won't be hashed. It's only the mirror:// url itself that is hashed.

Also, with such a feature we could easily implement a fetchurl like

fetchurl {
  url = "https://example.com/original-source";
  alternateUrls = [ "https://foo.com/other-source" ];
}

And we'd only let url influence the hash, but not alternateUrls. In the implementation, curl would try the url first, but fall back to alternateUrls if that failed. While this isn't perfect, it's an easy workaround for preventing global rebuilds if necessary.


We do need to keep in mind though: Changing the URL and expecting the hash to not change is a very rare use case. I don't think we should let such a proposal be rejected because of this. My guesstimate is that in at least 99.9999% of cases, changing the URL is also expected to change the hash. By making this common case have a much better user experience with such a proposal, we can save a lot of time and frustration, especially for beginners. In my opinion the benefits of this vastly overshadow the disadvantages (which can be worked around, see above).

@7c6f434c
Copy link
Member

7c6f434c commented Jan 6, 2021 via email


| Fetchers | Input string |
| --- | --- |
| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@7c6f434c Oh and actually I even specified this here already that only the first URL of multiple should be used in the hash, so yeah, no need for additionalUrls, just the existing urls can do that already.

@Ericson2314
Copy link
Member

Agree with @edolstra this is tempting and yes I've been bitten too, but this is absolutely the wrong direction. We must find a better solution. As they say in Haskell land "avoid success-at-all-costs" (added the dashes to disambiguate, don't need the joke here).

In NixOS/nix#3635 I make it so we can just use the git hash, and don't need a nar hash. Doesn't that solve the problem nicely without these downsides?

@infinisil
Copy link
Member Author

@Ericson2314 That's awesome, but it only works for content-addressed data, so not with most fetchurl, fetchzip, and co. calls.

@Ericson2314
Copy link
Member

@infinisil it works fine for the fetchZip behind fetchFromGitHub. I also think git covers the majority of the frequently changing sources. The new users getting hit by this the most are probably working with their own code too, which is also more likely to be in git.

There are number of reasons post-tarball and post-compression hashing is bad, too. See https://www.tweag.io/blog/2020-06-18-software-heritage/ for example. Keep that case worse is not ideal, but has a silver lining of pushing people in a good direction.

@Ericson2314
Copy link
Member

Another idea is that if the narinfo for the build we found has a deriver field, and it differs "sufficiently little" to the fixed-output derivation we're trying to build, we can issue a warning. This will have false-positives, though.

@infinisil
Copy link
Member Author

Recently on IRC I discussed Nix hashing with @adisbladis, where we concluded that if we want package managers to interact nicely with Nix, we need a standard directory hashing scheme standard. Researching that I stumbled upon dirhash, which is exactly that.

While writing this RFC I didn't make the connection, but now I can't deny it. Notably this dirhash standard allows you to customize the hash function: It lets you configure which files should be ignored, how to treat recursive symlinks, and more. It also points out that you can extend the standard with more such properties.

In our case, we could see the URL and any other relevant inputs as exactly that. If Nix supported such a hashing standard, we would e.g. specify output hashes like

{
  outputHash = {
    mode = "dirhash";
    version = "0.1.0";
    algorithm = "sha256";
    hash = "...";
    protocol.nixInputs.url = url;
    # ...
  };
}

Now when Nix wants to check whether a certain hash exists already, it can:

  • Check if hash exists already
  • If not, build the derivation and hash the result according to the outputHash properties. Store these properties in the database.
  • If it does exist already, check that the given properties match the ones in the database. If they don't, build the derivation to check that the hash matches the existing one.

The store path should notably only be derived from the hash, not the properties. Meaning that changing e.g. nixInputs.url doesn't cause a mass rebuild.

I can't see anything obviously wrong with this and this doesn't have any of the previously-discussed downsides. This would also pave the way for better package manager integration with a directory hashing standard.

Ping @regnat as well

@Ericson2314
Copy link
Member

@infinisil I would consider the rest of the properties a separate sort of derivation, in a way. Instead of saying exactly "how to build" something, they are making a claim to what the data represents. From a substitutee's standpoint, this is no worse because we are choosing to trusting the substitutor anyways, so whether or not the claim is "how" or "why", it doesn't super matter.

I think the concept of "alt derivations" is good in general. Another option is "merkle inclusion proof"s. Since the world at large prefers tags and commit hashes to tree hashes, it's nice to not specify the tree hash (which may not be as well known) directly. But if we specifying something like "give me the tree from this commit" or even "give me the tree from this ancestor of this commit", rather than download all the history, we can just download a "spine" of nodes and siblings which proves that the tree hash at the bottom is included in the original hash. This trustlessly lets us use fewer hashes, which is very nice.

@terlar
Copy link

terlar commented Jan 7, 2021

How would this affect someone using a GitHub copy on read mirror via Artifactory. At my company we need to be able to guarantee working even if GitHub or something similar goes down, or potentially if the sources get renamed or removed (e.g. I ran into this when using fetchFromGitHub with hercules-ci/gitignore.nix when it was renamed from hercules-ci/gitignore). For that purpose we have used the following overlay:

final: prev:
  {
    fetchFromGitHub = { ... }@args:
      prev.fetchFromGitHub ({
        githubBase = "artifactory.path/artifactory/r-generic-github";
      } // args);
  }

What would happen in this case? Would all the sources be downloaded again, would it re-trigger builds?

@Ericson2314
Copy link
Member

@terlar Yes! This is a great demo of why content addressing is good, and making source authoritative when there is also an authoritative content address bad. The RFC as written would get stuck, but @infinisil's last comment does look more promising.

(Relatedly, I think it might be good to allow not only a single fixed-output derivation derivation, but a set of them providing the same path to be jointly depended upon, to basically make mirroring/fallbacks a first-class concept in Nix at the store level.)

@infinisil
Copy link
Member Author

I guess for fetchGit, fetchFromGitHub and co. we'd want only the Git hash to influence the store path, not the url, so this doesn't become a problem.

But yeah I think I'll either change this RFC to the proposal in #84 (comment) or create a new one for that. I think that's a much better approach than the one described here.

/nix/store/l98gjfznp8lpxi0hvj4i0rw34xnnqma8-source
```

This can't be right! Of course, people with Nix knowledge will tell you that you need to change the hash to something invalid first in order for it to attempt a refetch. This is because Nix already has a path with that exact output `sha256` hash, so it just reuses that. Indeed, changing the output hash to something invalid works:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, people with Nix knowledge will tell you that you need to change the hash to something invalid first in order for it to attempt a refetch.

I'm sorry to hear that this is encouraged as a good practice. Shouldn't people be told instead to use prefetchers whenever updating a source, rather than wastefully downloading the source twice just to get a hash mismatch the first time?

(I concede that I do the invalid-hash procedure myself for Rust packages, for which using a prefetcher would have been beyond my understanding of the Rust packaging functions, but I always use prefetchers for URL sources and Git sources.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't download it twice, since it stores the result under the store path for the calculated hash the first time

@edolstra
Copy link
Member

edolstra commented Jan 8, 2021

we need a standard directory hashing scheme standard

We have it, namely the NAR hash.

@thufschmitt
Copy link
Member

Another (future) possible solution to the issue is to treat fixed-output derivations as a special-case of content-addressed derivations (as defined in #66) where we assert the expected output hash. The key difference with the way we handle them right now is that a change in the derivation would cause it to be rebuilt (like we do for input-addressed derivations), but we would stop the rebuild right after that (or error out if the new build doesn't have the expected hash). So we'd have slightly more rebuilds than currently (because we would rebuild fo derivations when their definition changes), but not too much (because we still wouldn't rebuild whatever depends on it) ; and that would solve the problem at hand as we would be able to notice every hash mismatch

@infinisil
Copy link
Member Author

Closing this as I don't think the proposed solution here is a good idea. If I have enough time I'll soon write a new RFC with the idea from #84 (comment) (and actually, it doesn't have to involve dirhash at all)

@infinisil infinisil closed this Jan 10, 2021
@infinisil infinisil deleted the input-aware-fetchers branch January 10, 2021 22:55
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/tweag-nix-dev-update-6/11195/1

@Ekleog
Copy link
Member

Ekleog commented Mar 23, 2021

@infinisil I just had a thought that might just re-use this approach without (I think) its drawbacks: What if we just add an extraHashedData attrset that gets added in the fixed-output hash, and then have it be defaulted to the output-affected parameters? This way, if someone wants to change the URL without changing the hash, it becomes possible to just override both the url (with the new URL) and the extraHashedData.url (with the old URL) parameters, and it should give the old hash. And at the same time, it means that for simple updates, just changing the URL would be enough to trigger a re-download and thus avoid the “unknowingly using the old FOD” issue.

@wmertens
Copy link

Right now, all I do when I update a fixed output is to remove the sha1. It will then download it and tell me the new hash.

@kevincox
Copy link
Contributor

@Ekleog the problem with that approach is that if the original URL goes offline it will be impossible to provide an alternate URL of the same data. I guess we could add an optional originalUrl parameter but this seems needlessly confusing. I think it would be nice to have a dedicated prefix like hash = "regen:" to force a fetch and report but I have a macro that replaces the hash with 0000000000000000000000000000000000000000000000000000 which is good enough.

@SuperSandro2000
Copy link
Member

which is good enough.

No, it is not. I want to turn off fixed output derivations on a developer machine to know when download links break or I forgot to change a hash.

@7c6f434c
Copy link
Member

to turn off fixed output derivations on a developer machine

… that's one thing I really do not want happenning on any machine I use.

Maybe what is needed is an overlay that hashes all the parameters to fetchers and appends hash to name, then copies it into a normal fixed-output derivation with the compatible hash, and uses the tests-required-but-do-not-affect-hash by @Profpatsch to make sure the non-fixed version works before you are allowed to use the properly fixed version?

@Ekleog
Copy link
Member

Ekleog commented Mar 27, 2021

@kevincox I'm not sure my suggestion was clear? What I'm suggesting is:

  1. We add a extraHashedData attrset to fetchurl, that is taken into account in the hash
  2. We make extraHashedData default to { url = args.url; } (this is a world-rebuild)
  3. When updating the url field, it thus automatically changes the hash by being hashed
  4. When updating the url field without wanting a hash change (which IME is quite clearly the exception), one would set url = new-url; extraHashedData = { url = old-url; };
  5. When updating the url field of a fetchurl that already has extraHashedData, hopefully the extraHashedData will be obvious enough that people will remove it and then gain the benefit of the hash automatically changing again.

I don't think this would be particularly confusing, as extraHashedData is pretty clear in naming; and it would support both use cases in a, I think, reasonable way.

Does this idea make sense?

@kevincox
Copy link
Contributor

That makes sense. I think the requirement to set the old URL is a bit weird but avoiding accidentally building on a old hash seems worth it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet