Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly store the outputs of CA derivations − take 2 #4330

Merged
merged 3 commits into from Dec 14, 2020

Conversation

thufschmitt
Copy link
Member

@thufschmitt thufschmitt commented Dec 8, 2020

Reopening of #4315 with a branch from the nixos/nix repo just to allow dpulls to work properly

Fix #4138

@Ericson2314
Copy link
Member

Well this is a boring thing to bring up, but what do all you say we use JSON for the new binary cache files? It would be nice to avoid these bespoke parsers a bit.

@thufschmitt
Copy link
Member Author

Well this is a boring thing to bring up, but what do all you say we use JSON for the new binary cache files? It would be nice to avoid these bespoke parsers a bit.

Good point… I kept the same syntax as the narinfo files for consistency, but I don't care about changing that − might even be nice to have a more structured format to store lists of things or the like.

I'd rather not have it block this PR though ;)

@edolstra
Copy link
Member

edolstra commented Dec 8, 2020

Yes, I'm in favor of using JSON for new types of files (in fact we're using it for NAR listings in the binary cache).

Comment on lines 460 to 461
auto filePath = "/drvOutputs/" + std::string(info.id.drvPath.hashPart()) +
"!" + info.id.outputName + ".doi";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you all think about nesting further like this?

Suggested change
auto filePath = "/drvOutputs/" + std::string(info.id.drvPath.hashPart()) +
"!" + info.id.outputName + ".doi";
auto filePath = "/drvOutputs/" + std::string(info.id.drvPath.hashPart()) +
"/" + info.id.outputName + ".doi";

rather than using !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ! because that's the already standard delimiter for derivation outputs, so it makes sense to use the same on binary caches − I've actually pushed 391f753 to make this use id.to_string() directly which makes this more natural.

That being said, I don't have a very strong opinion, so a / could be good to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even use a / for id.to_string(), but I like the fact that ! gives a clear message of this isn't a store path, don't try to use it as such

@thufschmitt
Copy link
Member Author

thufschmitt commented Dec 8, 2020

Yes, I'm in favor of using JSON for new types of files (in fact we're using it for NAR listings in the binary cache).

Ok, I've opened #4332 to track this

@Ericson2314
Copy link
Member

@regnat does this make any changes w.r.t. storing the unresolved derivations? I think it doesn't but I am not yet sure. Of course, I'll be very happy of it doesn't :).

@thufschmitt
Copy link
Member Author

@regnat does this make any changes w.r.t. storing the unresolved derivations? I think it doesn't but I am not yet sure. Of course, I'll be very happy of it doesn't :).

Nah that one doesn't

@thufschmitt thufschmitt self-assigned this Dec 9, 2020
@thufschmitt thufschmitt added the ca-derivations Derivations with content addressed outputs label Dec 9, 2020
@Ericson2314
Copy link
Member

Nah that one doesn't

Great!

This introduces some redundancy on the remote-store side between
wopQueryDerivationOutputMap and wopQueryDrvOutputInfo.
However we might need to keep both (regardless of backwards compat)
because we sometimes need to get some infos for all the outputs of a
derivation (where wopQueryDerivationOutputMap is handy), but all the
stores can't implement it − because listing all the outputs of a
derivation isn't really possible for binary caches where the server
doesn't allow to list a directory.

What if we stored one map per derivation with all the outputs in the binary caches?

(Also, I think that was renamed wopQueryRealisation?)

@thufschmitt
Copy link
Member Author

This introduces some redundancy on the remote-store side between
wopQueryDerivationOutputMap and wopQueryDrvOutputInfo.
However we might need to keep both (regardless of backwards compat)
because we sometimes need to get some infos for all the outputs of a
derivation (where wopQueryDerivationOutputMap is handy), but all the
stores can't implement it − because listing all the outputs of a
derivation isn't really possible for binary caches where the server
doesn't allow to list a directory.

What if we stored one map per derivation with all the outputs in the binary caches?

That was my initial idea, but it's possible to only push a subset of the outputs to the binary cache, meaning that if you add some more afterwards you'd have to update the entry, and that would break the − really nice property imho − that everything in the binary cache is immutable (we could also use a storing scheme like drvHash/outputName like you suggest above and list the directories, but not every cache permits listing directories − and I think not requiring that is also a very cool property that would be sad to loose).

(Also, I think that was renamed wopQueryRealisation?)

Someone should make a variant of sed that also updates the commit messages to make sure that they accurately reflect your changes 😒

@Ericson2314
Copy link
Member

@regnat well what if you had to push the mapping for all outputs whenever you pushed any of them? Fundamentally, the mappings all trace back to builds, and it's impossible to just build some of the derivations, so I think this could actually work.

I think unlike the local store, we might very much want to have a relation not function for the binary cache. This would allow pushing conflicting mappings when we have some outputs and other outputs from another, in a particularly honest fashion. I guess we could just do <drv-hash>/<mapping-file-checksum> or something to avoid collisions or the pain of dealing with incremented numbers.

@thufschmitt
Copy link
Member Author

@regnat well what if you had to push the mapping for all outputs whenever you pushed any of them? Fundamentally, the mappings all trace back to builds, and it's impossible to just build some of the derivations, so I think this could actually work.

Up to a certain point, yes, except that you can nix-build foo.out && nix-collect-garbage && nix copy --to binary-cache foo.out (or just nix-build foo.out && nix copy --to binary-cache foo.out where foo.out is already in a different binary cache so you don't have to build foo.dev).

I think unlike the local store, we might very much want to have a relation not function for the binary cache. This would allow pushing conflicting mappings when we have some outputs and other outputs from another, in a particularly honest fashion. I guess we could just do <drv-hash>/<mapping-file-checksum> or something to avoid collisions or the pain of dealing with incremented numbers.

I'm not sure I get exactly how that would work… do you mind expanding a bit further how you consider things? (If possible in a new issue as this PR doesn't do much for binary caches except for a basic and probably not definitive implementation of queryRealisation)

@Ericson2314
Copy link
Member

Ericson2314 commented Dec 9, 2020

@regnat Sure I can expand elsewhere, maybe let's just merge this, as long as we are happy of all manner of breaking changes elsewhere. queryRealisation is guarded by the experimental issue so I think we're good.

@thufschmitt
Copy link
Member Author

@regnat Sure I can expand elsewhere, maybe let's just merge this, as long as we are happy of all manner of breaking changes elsewhere. queryRealisation is guarded by the experimental issue so I think we're good.

I'm definitely happy with any breaking change that migh (and will) happen!

I've rebased to fix the mention of wopQueryDrvOutputInfo commit message.

@Ericson2314
Copy link
Member

I made the multiple outputs issue: #4344. [Side note, I wonder if I could get enough perms to at label the issue and whatnot?]

-- Won't be loaded unless the experimental feature `ca-derivations`
-- is enabled

create table if not exists Realisations (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "realisations" seems a bit confusing. What does it realise? In Nix, "realise" is already a vague and overloaded term for "substituting or building a derivation". There is also a terminology mismatch between registerDrvOutput and queryRealisation - shouldn't that be registerRealisation or queryDrvOutput?

How about CADerivers or CADerivationOutputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's indeed a terminology mismatch and we need to settle on one thing.

I really like Realisation tbh because these are precisely witnesses that a derivation has been realised (for the definition you give above). But it's indeed a quite overloaded term so if you don't like it we can find something else.

I tend to use DerivationOutput only for the “symbolic” foo.drv!out because it's also a very overloaded term and it's easy to get lost if we abuse of it.

(also, they aren't really Derivers (though they could be used as such) because the intent is to associate derivation outputs (foo.drv!out) to their output paths and not the other way around)

TL;DR, I'm fine changing the naming, but for my own sanity (and other's too) I think we need to find clear and distinct names for

  1. foo.drv!out
  2. The relation (foo.drv!out, /nix/store/123-foo-out)

Comment on lines 6 to 7
id integer primary key autoincrement not null,
drvPath text not null,
outputName text not null, -- symbolic output id, usually "out"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't guarantee that (drvPath, outputName) is unique. That seems okay (since we could build a derivation multiple times and get a different result) but if we want to ensure uniqueness, we could have (drvPath, outputName) as the primary key and drop id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks I should have caught this. Yes let's start as stringent as possible and only relax later if need be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I'll drop the id for now

foreign key (outputPath) references ValidPaths(id) on delete cascade
);

create index if not exists IndexRealisations on Realisations(outputPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need an index on (drvPath, outputName)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Actually the index on outputPath is quite useless

src/libstore/local-store.cc Outdated Show resolved Hide resolved
src/libstore/local-store.cc Outdated Show resolved Hide resolved
src/libstore/local-store.cc Outdated Show resolved Hide resolved
src/libstore/local-store.cc Outdated Show resolved Hide resolved
@@ -21,7 +21,7 @@ namespace nix {
0.7. Version 2 was Nix 0.8 and 0.9. Version 3 is Nix 0.10.
Version 4 is Nix 0.11. Version 5 is Nix 0.12-0.16. Version 6 is
Nix 1.0. Version 7 is Nix 1.3. Version 10 is 2.0. */
const int nixSchemaVersion = 10;
const int nixSchemaVersion = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to bump the schema version now that we have a sub-version for the CA-related stuff, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope indeed

@@ -99,6 +102,10 @@ public:
StorePath addTextToStore(const string & name, const string & s,
const StorePathSet & references, RepairFlag repair) override;

void registerDrvOutput(const Realisation & info) override;

std::optional<const Realisation> queryRealisation(const DrvOutput &) override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a std::vector<Realisation> since there could be multiple realisations for a particular derivation+output-name. Even if we don't really support multiple realisations at the moment, it's probably good to express the possibility in the interface. When we need a particular realisation for a dependency (when building) we can just pick the first one for now. But in the future we could have an algorithm that tries to pick realisations that minimise the amount of closure rewriting needed to get a consistent input closure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're okay with it, I'll rather just change the daemon protocol to allow passing a list as it's the only place in which we can't break backwards compat for free, and it makes more sense to me to make it clear that we can only return one realisation for the time being as long as there's nothing preventing us from changing that in the future

@@ -443,6 +443,23 @@ StorePath BinaryCacheStore::addTextToStore(const string & name, const string & s
})->path;
}

std::optional<const Realisation> BinaryCacheStore::queryRealisation(const DrvOutput & id)
{
auto outputInfoFilePath = realisationsPrefix + "/" + id.to_string() + ".doi";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does doi mean?

BTW this URL format means that there can be only one mapping per realisation, which is probably fine for a binary cache, but it does mean you can't take the union of two binary caches (e.g. a mirror of multiple upstream binary caches) - or at least not of their /realisations directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doi stands for “derivation output info” (it's indeed quite obscure).

Yes, I agree on the uniqueness constraint that might be somehow too strong. But I think we'll have time to rework this schema − it's experimental anyways, so nothing is set in any way, and it's not really clear to me atm how we could have a sensible implementation if we drop it.

Copy link
Member Author

@thufschmitt thufschmitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edolstra Just gave a pass after your comments, I'll let you have a second look

@@ -443,6 +443,23 @@ StorePath BinaryCacheStore::addTextToStore(const string & name, const string & s
})->path;
}

std::optional<const Realisation> BinaryCacheStore::queryRealisation(const DrvOutput & id)
{
auto outputInfoFilePath = realisationsPrefix + "/" + id.to_string() + ".doi";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doi stands for “derivation output info” (it's indeed quite obscure).

Yes, I agree on the uniqueness constraint that might be somehow too strong. But I think we'll have time to rework this schema − it's experimental anyways, so nothing is set in any way, and it's not really clear to me atm how we could have a sensible implementation if we drop it.

@@ -99,6 +102,10 @@ public:
StorePath addTextToStore(const string & name, const string & s,
const StorePathSet & references, RepairFlag repair) override;

void registerDrvOutput(const Realisation & info) override;

std::optional<const Realisation> queryRealisation(const DrvOutput &) override;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're okay with it, I'll rather just change the daemon protocol to allow passing a list as it's the only place in which we can't break backwards compat for free, and it makes more sense to me to make it clear that we can only return one realisation for the time being as long as there's nothing preventing us from changing that in the future

@@ -21,7 +21,7 @@ namespace nix {
0.7. Version 2 was Nix 0.8 and 0.9. Version 3 is Nix 0.10.
Version 4 is Nix 0.11. Version 5 is Nix 0.12-0.16. Version 6 is
Nix 1.0. Version 7 is Nix 1.3. Version 10 is 2.0. */
const int nixSchemaVersion = 10;
const int nixSchemaVersion = 11;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope indeed

@Ericson2314
Copy link
Member

Ericson2314 commented Dec 10, 2020

as it's [queryRealization in the daemon protocol is] the only place in which we can't break backwards compat for free

How come? Can we change all implementations to assert the experimental feature before doing anything else?

@thufschmitt
Copy link
Member Author

as it's [queryRealization in the daemon protocol is] the only place in which we can't break backwards compat for free

How come? Can we change all implementations to assert the experimental feature before doing anything else?

Oh yes, you're right. Well, now it's implemented and the code is actually cleaner that way, so might as well keep it

@Ericson2314
Copy link
Member

Yeah I'm not against this new version (though I haven't closely looked), just think if we did #4344 we definitely would change it again.

For each known realisation, store:
- its output
- its output path

This comes with a set of needed changes:

- New `realisations` module declaring the types needed for describing
  these mappings
- New `Store::registerDrvOutput` method registering all the needed informations
  about a derivation output (also replaces `LocalStore::linkDeriverToPath`)
- new `Store::queryRealisation` method to retrieve the informations for a
  derivations

This introcudes some redundancy on the remote-store side between
`wopQueryDerivationOutputMap` and `wopQueryRealisation`.
However we might need to keep both (regardless of backwards compat)
because we sometimes need to get some infos for all the outputs of a
derivation (where `wopQueryDerivationOutputMap` is handy), but all the
stores can't implement it − because listing all the outputs of a
derivation isn't really possible for binary caches where the server
doesn't allow to list a directory.
Add a new table for tracking the derivation output mappings.

We used to hijack the `DerivationOutputs` table for that, but (despite its
name), it isn't a really good fit:

- Its entries depend on the drv being a valid path, making it play badly with
  garbage collection and preventing us to copy a drv output without copying
  the whole drv closure too;
- It dosen't guaranty that the output path exists;

By using a different table, we can experiment with a different schema better
suited for tracking the output mappings of CA derivations.
(incidentally, this also fixes #4138)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ca-derivations Derivations with content addressed outputs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The drvOutputId->OutputPath link is lost for CA derivations when the drv file is deleted
3 participants