staging workflow #26

vcunat · 2018-03-05T09:55:24Z

Define a new workflow for the staging branch that can better accomodate the current and future influx of changes in order to deliver mass-rebuilds faster to master. As part of this new workflow an additional branch, staging-next, shall be introduced.

More details in the RFC proposal itself now.

shlevy · 2018-03-05T11:59:50Z

@vcunat Before going into the specifics of how we can change things on the workflow side, which I agree are probably needed regardless, I do want to point out that there seems to be an issue with hydra that exacerbates this. For at least two 5+-day periods this month, there was a case where some build was blocked on "sending inputs" for many days and that for some reason resulted in the queue not being pulled from properly. It's possible with that issue fixed we won't have to worry about this quite yet.

vcunat · 2018-03-05T12:12:19Z

Yes, I know this. If one is unlucky and a "mass-rebuild step" gets stuck this way...

shlevy · 2018-03-05T12:28:41Z

So it seems like right now there are a bunch of different things we use staging for:

Validate that package X still builds with my change to package X itself
Validate that the problem motivating my change is fixed by my change (not applicable if this is just "bump the version because there's a new one")
Validate that my change satisfies basic sanity checks for my system
Validate that my change satisfies basic sanity checks for all systems
Validate my change satisfies a full suite of checks for all systems when integrated with the latest, and make binaries available.

This is over-engineered, but just to share the thought: I think in principle we could achieve guaranteed* eventual merging of correct code with a series of queues, possibly prioritized. At each stage, jobs are processed in the order they are evaluated for that stage. Each jobset only allowed one evaluation in the given queue at a time, if it needs a new commit to be tested it goes back to the back of the line at that queue. A jobset at stage N+1 must have as its base a commit that succeeded at stage N. If any commit at stage M causes a change to the output path of one of the builds of stage N < M, that change must go back to the earliest such N. The last integration stage is special: merge commits that change the output paths of earlier steps but don't break them are allowed, but regular commits aren't.

The expected workflow is that you'd start off of some commit in master, iterate at each stage fixing the failures without changing what already worked, and only at the last stage merging in the latest and doing fixups there. This guaranteed eventual merging falls down if every time we get to the "integration" stage a new change has been merged that fundamentally breaks all earlier stages. But this seems very unlikely to happen by accident. Maybe we can simply allow only one change in the final step at a time, and allow arbitrary iteration within that step.

peti · 2018-03-05T16:06:48Z

Note that staging is assigned a ridiculously low 0.01% of Hydra's scheduling shares, i.e. it's building pretty much only if nothing else is. Maybe we should re-consider that jobset's priority?

vcunat · 2018-03-05T16:16:31Z

The regular jobsets certainly should have a very large advantage against staging, IMO, but on the whole we might agree on some system how to set the shares. It seems rather ad-hoc ATM.

shlevy · 2018-03-22T11:56:19Z

@vcunat What do you think of this workflow:

In the normal case, most things can just go to staging and big changes (like a gcc bump) have to have their own release-small jobset pass first. At any time, a maintainer can branch off staging-stabilization, which then changes the workflow to: anything that fixes things broken on staging-stabilization goes directly there (or, if applicable, into master then merged into staging-stabilization) and is then merged into staging, any new mass-rebuild changes go into staging. Once staging-stabilization is ready, it is merged into master and we go back to the normal workflow. staging-stabilization can have a higher proportion of shares and, since only fixes can go into it, is much more likely to actually reach readiness in a timely manner than staging, without anyone having to wait to integrate their work in with others.

grahamc · 2018-05-29T16:14:37Z

@vcunat how about we get this RFC written? :)

vcunat · 2018-05-29T18:32:53Z

Well, I don't expect to really have time to "maintain" staging this year, with all other things I've "promised" to do, so I'm perhaps not a good person to act on this. I believe Shea's or mine proposal or some hybrid might work reasonably. Perhaps it's better to first start using something and see how it goes, and document the resulting workflow retrospectively? (Both of the changes really affect only Hydra's setup and workflow of "staging maintainer(s)" and not regular merges to staging.)

FRidh · 2018-05-29T18:40:57Z

I also like the idea of two branches:

the current staging that we're working on getting into master (i.e. stabilizing)
the branch where other mass-rebuilds are staged that go into the next cycle

For 2) I would like to see a Hydra job that corresponds to a small set, say release-small plus some additional packages such as ghc. Just as a check before we start rebuilding everything.

xeji · 2018-06-01T20:21:55Z

I like the idea of staging -> staging-stabilization -> master.
We could even take that a step further and make it the standard workflow for everything:

All PRs go to staging by default. staging gets high scheduling priority on Hydra.
Only fixes to stabilize staging-stabilization or urgent fixes to master with limited rebuilds go there directly.
staging-stabilization is (automatically?) merged to master (then freshly branched off staging) when a defined set of tests succeed and the number of Hydra failures is below some threshold. Pretty much like a channel today but with stricter requirements. So master always has whatever we define as "release quality".

This approach would make it much easier to create and stabilize a release. It could also evolve into a rolling-release model in the future.

Ekleog · 2018-06-01T23:05:17Z

I'm not sure about having all PRs to staging: this loses the current advantage of having non-mass-rebuilding changes merged fast into master and thus into NixOS unstable. In exchange it'd mean more effort in getting staging merged, though, so maybe it'd actually be a net win?

7c6f434c · 2018-06-02T07:48:57Z

@Ekleog Hm, good question how to have a fast track and a slow track without having fast track catch the slow track in eternal catching-up and re-testing…

vcunat · 2018-06-02T11:15:31Z

I think it's doable here. Master especially shouldn't directly get any "risky" changes; if breakage happens, it should be quickly fixed or reverted ("moved" to another branch). Mass-rebuild status is now auto-determined on PRs (reliably), and non-mass changes rarely break very many packages, so workflow with fixes and running nixos tests can typically be iterated much faster than on staging.

7c6f434c · 2018-06-02T11:21:22Z

Well, sometimes staging contains fixes for a package that has been updated on the master while the staging was building.

vcunat · 2018-06-02T11:21:58Z

I merged the second-order PR immediately to reduce the level of indirection, but:

Hydra shares

I think Hydra shares should be lower for staging* than for master; that way master's small updates are delivered fast. Maybe we could even have some hydra.nixos.org -wide policy for the shares and check intervals (with particular numbers); it always feels ad-hoc when I'm setting those, mostly copying from some "similar" jobset. Possibly via another RFC :-)

Ekleog · 2018-06-02T14:43:49Z

rfcs/0026-staging-workflow.md

+# Alternatives
+[alternatives]: #alternatives
+
+What other designs have been considered? What is the impact of not doing this?


I think this would need to be removed / reworded before merging. Something like this?

## Keep the statu quo As explained above, this leads to issues with `staging` never managing to stabilize due to the stream of mass-rebuilds. ## Merge all changes in `staging` Also merging non-mass-rebuilds changes in `staging` would mean a slower delivery of non-mass-rebuilding changes to `master`. Given there is now a way to know, before merging a PR, whether it is a mass rebuild, the benefit of having non-mass-rebuilding changes go to `staging` are limited. The most important one is duplicate work in case a package doesn't build on `staging` while it has already been fixed in `master`. This can be handled by regularly merging `master` in `staging`, as `master` rebuilds should be almost free compared to `staging` rebuilds.

Filled. I can't see any potential advantage of "Merge all changes in staging" compared to the current workflow or the one proposed in this RFC, but it seems to be essentially the "Single branch" alternative in RFC draft now, if I understand it right.

Ekleog · 2018-06-02T14:45:51Z

rfcs/0026-staging-workflow.md

+# Future work
+[future]: #future-work
+
+-


Same here, maybe “No future work is expected from this RFC”? Or something like “Adjust hydra intervals and build shares depending on the load that will be observed”?

Actually, maybe reevaluating the choice of “small fixes in master” vs. “small fixes in staging-next too” could make sense after some experimentation, depending on how frequently we manage to merge staging into master with this new scheme.

Now "future work" is filled, basically in sense "put this RFC into action".

Ekleog · 2018-06-02T14:48:19Z

rfcs/0026-staging-workflow.md

+start-date: (fill me in with today's date, YYYY-MM-DD)
+author: (name of the main author)
+co-authors: (find a buddy later to help our with the RFC)
+related-issues: (will contain links to implementation PRs)


I think this should be filled in.

xeji · 2018-06-02T15:16:18Z

rfcs/0026-staging-workflow.md

+
+
+The check interval of `staging` is reduced from 24 hours to 12 hours. This can
+be done because only stabilization fixes shall be submitted and thus fewer


This sentence seems to contradict the definition of the branches above. "only stabilization fixes" is true for staging-next, not staging.

I'm also slightly confused about what staging* branch is intended for pull requests vs stabilisation after reading this.

Changed to staging-next.

Confusion: I believe that any fixes-only changes should go to staging-next (counting only problems present in the branch already). I would hope that the current text makes it clear:

staging-next is branched from staging and only fixes to stabilize and security fixes shall be delivered to this branch. This branch shall be merged into master when deemed of sufficiently high quality.

but certainly feel free to suggest a clearer formulation.

The exact mechanism how they get there doesn't seem too important to me. Ideally they get committed to staging-next directly, but if they might be in staging already (accidentally) and get cherry-picked to staging-next...

xeji · 2018-06-02T15:19:40Z

rfcs/0026-staging-workflow.md

+[drawbacks]: #drawbacks
+
+A potential drawback of this new workflow is that the additional branch may be considered complicated and/or more difficult to work with.
+


However, most contributors would keep working as before: choose master or staging depending on the number of rebuilds.

Right, added.

xeji · 2018-06-02T15:21:51Z

rfcs/0026-staging-workflow.md

+- `master` is the main branch and all small deliveries shall go here;
+- `staging` is branched from `master` and mass-rebuilds and other large deliveries go to this branch;
+- `staging-next` is branched from `staging` and only fixes to stabilize and security fixes shall be delivered to this branch.
+


Clarify how changes end up in master. From the discussion, my understanding is that only staging-next (never staging directly) is eventually merged into master.
Do we want to set criteria for when staging-next is "good enough" for merging to master?

Close #2

7c6f434c · 2018-06-24T09:02:57Z

rfcs/0026-staging-workflow.md

+
+Reducing the size of the Hydra jobset would mean the iteration pace could be
+higher, but has the downside of testing fewer packages, and having fewer binary
+substitutes available.


Another option here is to reduce the channel-blocking Hydra jobset size, but have a 48-hourly cycle of trying to build the long tail. Still probably increases amount of things broken in the channel, but might be a middle ground. (I am not saying I prefer that compromise to the current proposal)

The channel-update criteria contain waiting for all builds to finish, but the thing is doable by having two jobsets on the branch. Personally I think the channel-blocking set should grow and not shrink, especially for the release branches.

I agree, but I also think it is good to have options enumerated and explicitly dismissed with a reason given.

Added a sentence describing the approach.

7c6f434c · 2018-06-24T09:04:08Z

I think you should remove the first line of the initial message now.

7c6f434c · 2018-06-24T09:07:28Z

rfcs/0026-staging-workflow.md

+
+# Drawbacks
+[drawbacks]: #drawbacks
+


Looking at the proposal, I am not completely sure if it will in fact reduce or increase the overall number of jobs for Hydra to build. It would increase usefuk information for unit of build time, but it might increase the load.

This risk should either be mentioned as a possible drawback or there should be a discussion why this risk is not likely.

I assume we will see how the load turns out in real life and adjust shares/intervals or even the whole workflow. People actively working on staging stabilization should also be able cancel jobs and force evaluations.

In any case, I would like to push towards more power in Hydra over the long term, as I believe it will help with mass rebuilds and human time is more precious than machine time.

Good point about Hydra access (this is a part of proposed workflow, right?), and I agree in principle about expanding Hydra checks.

I just want things mentioned explicitly.

I think such explicit discussion could also be a way to better understand the trade-offs relative to complicated alternatives: staging (release-small) -> staging-next (release-small) -> staging-next (release) (manually triggered, unless Hydra can have jobsets with preconditions).

Well, I was always doing cancels and evals on Hydra when working with staging. The right to manage Hydra jobsets doesn't seem more security-sensitive that push access to nixpkgs. Overall, this text so far hasn't mentioned people – who will be doing this staging stabilization, etc.

Complicated alternatives: another thing to consider is that Hydra seems likely to be over-powered on x86 linux (relative to other platforms), and even for other reasons it makes sense for most changes to stabilize them for x86_64-linux first/deeper before others.

Well, it doesn't mention people but it describes what tools these people are expected to use (and how). You do discuss possible evaluation frequencies.

And yes, I would find it nice if you mentioned possible using buildpower disbalance.

vcunat · 2018-06-24T09:10:20Z

@7c6f434c: which part do you mean?

7c6f434c · 2018-06-24T09:12:57Z

The text is not empty anymore.

vcunat · 2018-06-24T09:14:45Z

I see, I was just looking at the RFC text :-)

orivej

This seems ready for implementation.

oxij · 2018-10-05T19:57:34Z

Can't we rename `staging-next` to `testing`, `staging-stable` or whatever else after all? I'm sure everyone here, including me, got accustomed and stopped caring already, but I still think it is a very confusing naming scheme for new contributors.

xeji · 2018-10-05T20:58:16Z

I agree. staging-next is a confusing name. Initially I thought it meant the next stuff to go into staging...

staging-next fits the man gitworkflows docs nicely, where pu (staging) -> next (staging-next) -> master.

Why not simply call it next then?

asymmetric · 2018-11-11T18:24:00Z

Is there any agreement on how to proceed here? It's a bit unfortunate that we're stuck on a naming issue 😅

Let's throw a dice and let the universe decide.

LnL7 · 2018-11-14T19:57:45Z

This should just be merged. The proposal is in use and nitpicking about the name seems pointless to me.

zimbatm · 2018-11-15T14:03:46Z

Merging as it's implemented

jtojnar · 2018-11-27T01:52:55Z

Is the following diagram correct?

digraph {
    "small changes" [shape=none]
    "mass-rebuilds and other large changes" [shape=none]
    "critical security fixes" [shape=none]
    "broken staging-next fixes" [shape=none]

    "small changes" -> master
    "mass-rebuilds and other large changes" -> staging
    "critical security fixes" -> master
    "broken staging-next fixes" -> "staging-next"

    "staging-next" -> master [color="#E85EB0"] [label=stabilization] [fontcolor="#E85EB0"]
    "staging" -> "staging-next" [color="#E85EB0"] [label=stabilization] [fontcolor="#E85EB0"]

    master -> staging [color="#5F5EE8"] [label="any time"] [fontcolor="#5F5EE8"]
}

Open in Editor

Mic92 · 2018-11-27T10:43:45Z

I think we can append this graph to the RFC.

zimbatm · 2018-11-27T21:34:30Z

Has the RFC workflow been copied to nixpkgs yet? RFC are just here to gain consensus, the live documentation should live somewhere else.

LnL7 · 2018-11-28T19:41:45Z

@jtojnar I guess it's not really mentioned in the RFC but I think in general even critical security fixes should go to staging-next first.

This obviously depends on the size of the rebuild and current state of staging-next, but in my experience having a large queue on master is detrimental for the stability there and often results in breakages/issues that hold back channel updates.

Ekleog · 2018-11-28T23:53:42Z

It's actually in the RFC: (emphasis mine)

staging-next is branched from staging and only fixes to stabilize and security fixes shall be delivered to this branch.

(though granted I can't find a place where it explicitly states that security mass-rebuilds don't go in master)

jtojnar · 2018-11-29T00:09:04Z

I understood that as fixes to stabilize staging and fix vulnerabilities on staging.

That interpretation is consistent with the doctrine that critical security fixes should go directly to the branch that is affected by the vulnerabilities. Not sure if this doctrine still applies with this RFC accepted.

nixos-discourse · 2019-04-29T18:05:01Z

This pull request has been mentioned on Nix community. There might be relevant details there:

https://discourse.nixos.org/t/release-process-staging-branches/2799/2

nixos-discourse · 2019-04-29T18:05:10Z

This pull request has been mentioned on Nix community. There might be relevant details there:

https://discourse.nixos.org/t/release-process-staging-branches/2799/3

nixos-discourse · 2019-08-27T16:51:38Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/maintainers-needed-for-merging-staging-staging-next-branches/3833/3

nixos-discourse · 2019-09-23T18:09:00Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-shell-on-a-python-package-gives-a-pip-error/4137/4

jtojnar · 2020-12-05T11:18:30Z

I bring you the latest diagram:

Now the reverse flow (master → staging-next → staging) happens automatically every six hours thanks to NixOS/nixpkgs#105153.

digraph {
    "small changes" [shape=none]
    "mass-rebuilds and other large changes" [shape=none]
    "critical security fixes" [shape=none]
    "broken staging-next fixes" [shape=none]

    "small changes" -> master
    "mass-rebuilds and other large changes" -> staging
    "critical security fixes" -> master
    "broken staging-next fixes" -> "staging-next"

    "staging-next" -> master [color="#E85EB0"] [label="stabilization ends"] [fontcolor="#E85EB0"]
    "staging" -> "staging-next" [color="#E85EB0"] [label="stabilization starts"] [fontcolor="#E85EB0"]

    master -> "staging-next" -> staging [color="#5F5EE8"] [label="every six hours/any time"] [fontcolor="#5F5EE8"]
}

Open in Editor

Ericson2314 · 2020-12-05T20:54:12Z

^ This is great, should go in the Nipxkgs manual.

jtojnar · 2020-12-05T22:35:24Z

See NixOS/nixpkgs#105986 😉

FRidh · 2020-12-10T08:30:20Z

Commenting on @jtojnar NixOS/nixpkgs#106302

This is a security fix so it should go to master according to #26. Actually, that says staging-next so I am confused because I kept hearing master and even repeated that myself.

Yes I think we need to revise/clarify what is written there. The rfc indeed says staging-next, however it lacks a motivation.
Regarding master versus staging-next, there are pros and cons to both. I don't recall what I was thinking about this at the time; the biggest issue back then was the influx of changes on staging all the time, making stabilization impossible.

Critical security issues

I'd argue that critical issues should go to master, however if staging-next is as good as ready, that should merged right as well as well because that saves rebuilds. At that point it does not matter whether the fix goes to master or staging-next regarding the amount of rebuilds. It does matter regarding channel updates; during the rebuild there won't be any updates due to other smaller changes, but that is not very important. What is important is that nixos-small gets updated.

This obviously depends on the size of the rebuild and current state of staging-next, but in my experience having a large queue on master is detrimental for the stability there and often results in breakages/issues that hold back channel updates.

This I agree with. So this is a trade-off, and I think in case of critical security fixes the delay of a day or two due to the rebuilds is worth it.

One should however also note that the security fix is typically also backported. A rebuild of the stable branch has a higher priority and that causes the blocking of master for a period longer than a day or two.

Non-critical security fixes

These go through staging as usual, to reduce the amounts of rebuilds necessary.

* staging workflow: init empty RFC * 0026-staging-workflow: first draft * 0026-staging-workflow: second draft Close vcunat#2 * 0026-staging-workflow alternatives: more discussion

staging workflow: init empty RFC

0fe7191

dtzWill mentioned this pull request Mar 6, 2018

kernel-headers: remove deprecated and broken asm symlink fixups NixOS/nixpkgs#36366

Merged

8 tasks

orivej mentioned this pull request May 31, 2018

lz4: 131 -> 1.8.2 NixOS/nixpkgs#41293

Merged

8 tasks

0026-staging-workflow: first draft

728f66a

Ekleog approved these changes Jun 2, 2018

View reviewed changes

xeji reviewed Jun 2, 2018

View reviewed changes

0026-staging-workflow: second draft

7515313

Close #2

7c6f434c reviewed Jun 24, 2018

View reviewed changes

0026-staging-workflow alternatives: more discussion

4932519

orivej approved these changes Jul 4, 2018

View reviewed changes

xeji mentioned this pull request Oct 16, 2018

boehmgc: 7.6.8 -> 8.0.0 NixOS/nixpkgs#48560

Merged

vcunat mentioned this pull request Oct 29, 2018

apparmor: 2.12.0 -> 2.13.1 NixOS/nixpkgs#48550

Merged

zimbatm merged commit 56f5acd into NixOS:master Nov 15, 2018

7c6f434c mentioned this pull request Nov 23, 2018

[RFC 0036] Improving the RFC process #36

Merged

jtojnar mentioned this pull request May 10, 2019

sublimetext3 fails to run NixOS/nixpkgs#60807

Closed

vcunat mentioned this pull request Jun 14, 2019

gcc8 should be the default compiler NixOS/nixpkgs#63126

Closed

jtojnar mentioned this pull request Sep 29, 2019

RFC-0053: defining pull-request workflow #53

Closed

jtojnar mentioned this pull request Dec 9, 2020

gdk-pixbuf: 2.42.0 → 2.42.2 NixOS/nixpkgs#106302

Merged

10 tasks

vcunat deleted the staging branch April 14, 2021 08:03

mweinelt mentioned this pull request Mar 22, 2022

pytorch-bin: 1.10.0 -> 1.11.0 NixOS/nixpkgs#164712

Merged

21 tasks



		The check interval of `staging` is reduced from 24 hours to 12 hours. This can
		be done because only stabilization fixes shall be submitted and thus fewer

		[drawbacks]: #drawbacks

		A potential drawback of this new workflow is that the additional branch may be considered complicated and/or more difficult to work with.

staging workflow #26

staging workflow #26

Conversation

vcunat commented Mar 5, 2018 • edited

shlevy commented Mar 5, 2018

vcunat commented Mar 5, 2018

shlevy commented Mar 5, 2018

peti commented Mar 5, 2018

vcunat commented Mar 5, 2018 • edited

shlevy commented Mar 22, 2018

grahamc commented May 29, 2018

vcunat commented May 29, 2018

FRidh commented May 29, 2018 • edited

xeji commented Jun 1, 2018

Ekleog commented Jun 1, 2018

7c6f434c commented Jun 2, 2018

vcunat commented Jun 2, 2018

7c6f434c commented Jun 2, 2018

vcunat commented Jun 2, 2018 • edited

Hydra shares

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcunat Jun 24, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

7c6f434c commented Jun 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcunat commented Jun 24, 2018

7c6f434c commented Jun 24, 2018

vcunat commented Jun 24, 2018

orivej left a comment

Choose a reason for hiding this comment

oxij commented Oct 5, 2018 via email

xeji commented Oct 5, 2018

asymmetric commented Nov 11, 2018

LnL7 commented Nov 14, 2018

zimbatm commented Nov 15, 2018

jtojnar commented Nov 27, 2018

Mic92 commented Nov 27, 2018

zimbatm commented Nov 27, 2018

LnL7 commented Nov 28, 2018

Ekleog commented Nov 28, 2018

jtojnar commented Nov 29, 2018

nixos-discourse commented Apr 29, 2019

nixos-discourse commented Apr 29, 2019

nixos-discourse commented Aug 27, 2019

nixos-discourse commented Sep 23, 2019

jtojnar commented Dec 5, 2020 • edited

Ericson2314 commented Dec 5, 2020

jtojnar commented Dec 5, 2020

FRidh commented Dec 10, 2020 • edited

Critical security issues

Non-critical security fixes

vcunat commented Mar 5, 2018 •

edited

vcunat commented Mar 5, 2018 •

edited

FRidh commented May 29, 2018 •

edited

vcunat commented Jun 2, 2018 •

edited

vcunat Jun 24, 2018 •

edited

jtojnar commented Dec 5, 2020 •

edited

FRidh commented Dec 10, 2020 •

edited