Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

staging workflow #26

Merged
merged 4 commits into from Nov 15, 2018
Merged

staging workflow #26

merged 4 commits into from Nov 15, 2018

Conversation

vcunat
Copy link
Member

@vcunat vcunat commented Mar 5, 2018

Define a new workflow for the staging branch that can better accomodate the current and future influx of changes in order to deliver mass-rebuilds faster to master. As part of this new workflow an additional branch, staging-next, shall be introduced.

More details in the RFC proposal itself now.

@shlevy
Copy link
Member

shlevy commented Mar 5, 2018

@vcunat Before going into the specifics of how we can change things on the workflow side, which I agree are probably needed regardless, I do want to point out that there seems to be an issue with hydra that exacerbates this. For at least two 5+-day periods this month, there was a case where some build was blocked on "sending inputs" for many days and that for some reason resulted in the queue not being pulled from properly. It's possible with that issue fixed we won't have to worry about this quite yet.

@vcunat
Copy link
Member Author

vcunat commented Mar 5, 2018

Yes, I know this. If one is unlucky and a "mass-rebuild step" gets stuck this way...

@shlevy
Copy link
Member

shlevy commented Mar 5, 2018

So it seems like right now there are a bunch of different things we use staging for:

  1. Validate that package X still builds with my change to package X itself
  2. Validate that the problem motivating my change is fixed by my change (not applicable if this is just "bump the version because there's a new one")
  3. Validate that my change satisfies basic sanity checks for my system
  4. Validate that my change satisfies basic sanity checks for all systems
  5. Validate my change satisfies a full suite of checks for all systems when integrated with the latest, and make binaries available.

This is over-engineered, but just to share the thought: I think in principle we could achieve guaranteed* eventual merging of correct code with a series of queues, possibly prioritized. At each stage, jobs are processed in the order they are evaluated for that stage. Each jobset only allowed one evaluation in the given queue at a time, if it needs a new commit to be tested it goes back to the back of the line at that queue. A jobset at stage N+1 must have as its base a commit that succeeded at stage N. If any commit at stage M causes a change to the output path of one of the builds of stage N < M, that change must go back to the earliest such N. The last integration stage is special: merge commits that change the output paths of earlier steps but don't break them are allowed, but regular commits aren't.

The expected workflow is that you'd start off of some commit in master, iterate at each stage fixing the failures without changing what already worked, and only at the last stage merging in the latest and doing fixups there. This guaranteed eventual merging falls down if every time we get to the "integration" stage a new change has been merged that fundamentally breaks all earlier stages. But this seems very unlikely to happen by accident. Maybe we can simply allow only one change in the final step at a time, and allow arbitrary iteration within that step.

@peti
Copy link
Member

peti commented Mar 5, 2018

Note that staging is assigned a ridiculously low 0.01% of Hydra's scheduling shares, i.e. it's building pretty much only if nothing else is. Maybe we should re-consider that jobset's priority?

@vcunat
Copy link
Member Author

vcunat commented Mar 5, 2018

The regular jobsets certainly should have a very large advantage against staging, IMO, but on the whole we might agree on some system how to set the shares. It seems rather ad-hoc ATM.

@shlevy
Copy link
Member

shlevy commented Mar 22, 2018

@vcunat What do you think of this workflow:

In the normal case, most things can just go to staging and big changes (like a gcc bump) have to have their own release-small jobset pass first. At any time, a maintainer can branch off staging-stabilization, which then changes the workflow to: anything that fixes things broken on staging-stabilization goes directly there (or, if applicable, into master then merged into staging-stabilization) and is then merged into staging, any new mass-rebuild changes go into staging. Once staging-stabilization is ready, it is merged into master and we go back to the normal workflow. staging-stabilization can have a higher proportion of shares and, since only fixes can go into it, is much more likely to actually reach readiness in a timely manner than staging, without anyone having to wait to integrate their work in with others.

@grahamc
Copy link
Member

grahamc commented May 29, 2018

@vcunat how about we get this RFC written? :)

@vcunat
Copy link
Member Author

vcunat commented May 29, 2018

Well, I don't expect to really have time to "maintain" staging this year, with all other things I've "promised" to do, so I'm perhaps not a good person to act on this. I believe Shea's or mine proposal or some hybrid might work reasonably. Perhaps it's better to first start using something and see how it goes, and document the resulting workflow retrospectively? (Both of the changes really affect only Hydra's setup and workflow of "staging maintainer(s)" and not regular merges to staging.)

@FRidh
Copy link
Member

FRidh commented May 29, 2018

I also like the idea of two branches:

  1. the current staging that we're working on getting into master (i.e. stabilizing)
  2. the branch where other mass-rebuilds are staged that go into the next cycle

For 2) I would like to see a Hydra job that corresponds to a small set, say release-small plus some additional packages such as ghc. Just as a check before we start rebuilding everything.

@orivej orivej mentioned this pull request May 31, 2018
8 tasks
@xeji
Copy link

xeji commented Jun 1, 2018

I like the idea of staging -> staging-stabilization -> master.
We could even take that a step further and make it the standard workflow for everything:

  • All PRs go to staging by default. staging gets high scheduling priority on Hydra.
  • Only fixes to stabilize staging-stabilization or urgent fixes to master with limited rebuilds go there directly.
  • staging-stabilization is (automatically?) merged to master (then freshly branched off staging) when a defined set of tests succeed and the number of Hydra failures is below some threshold. Pretty much like a channel today but with stricter requirements. So master always has whatever we define as "release quality".

This approach would make it much easier to create and stabilize a release. It could also evolve into a rolling-release model in the future.

@Ekleog
Copy link
Member

Ekleog commented Jun 1, 2018

I'm not sure about having all PRs to staging: this loses the current advantage of having non-mass-rebuilding changes merged fast into master and thus into NixOS unstable. In exchange it'd mean more effort in getting staging merged, though, so maybe it'd actually be a net win?

@7c6f434c
Copy link
Member

7c6f434c commented Jun 2, 2018

@Ekleog Hm, good question how to have a fast track and a slow track without having fast track catch the slow track in eternal catching-up and re-testing…

@vcunat
Copy link
Member Author

vcunat commented Jun 2, 2018

I think it's doable here. Master especially shouldn't directly get any "risky" changes; if breakage happens, it should be quickly fixed or reverted ("moved" to another branch). Mass-rebuild status is now auto-determined on PRs (reliably), and non-mass changes rarely break very many packages, so workflow with fixes and running nixos tests can typically be iterated much faster than on staging.

@7c6f434c
Copy link
Member

7c6f434c commented Jun 2, 2018

Well, sometimes staging contains fixes for a package that has been updated on the master while the staging was building.

@vcunat
Copy link
Member Author

vcunat commented Jun 2, 2018

I merged the second-order PR immediately to reduce the level of indirection, but:

Hydra shares

I think Hydra shares should be lower for staging* than for master; that way master's small updates are delivered fast. Maybe we could even have some hydra.nixos.org -wide policy for the shares and check intervals (with particular numbers); it always feels ad-hoc when I'm setting those, mostly copying from some "similar" jobset. Possibly via another RFC :-)

# Alternatives
[alternatives]: #alternatives

What other designs have been considered? What is the impact of not doing this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would need to be removed / reworded before merging. Something like this?

## Keep the statu quo

As explained above, this leads to issues with `staging` never managing
to stabilize due to the stream of mass-rebuilds.

## Merge all changes in `staging`

Also merging non-mass-rebuilds changes in `staging` would mean a
slower delivery of non-mass-rebuilding changes to `master`. Given
there is now a way to know, before merging a PR, whether it is a mass
rebuild, the benefit of having non-mass-rebuilding changes go to
`staging` are limited.

The most important one is duplicate work in case a package doesn't
build on `staging` while it has already been fixed in `master`. This can
be handled by regularly merging `master` in `staging`, as `master`
rebuilds should be almost free compared to `staging` rebuilds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filled. I can't see any potential advantage of "Merge all changes in staging" compared to the current workflow or the one proposed in this RFC, but it seems to be essentially the "Single branch" alternative in RFC draft now, if I understand it right.

# Future work
[future]: #future-work

-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, maybe “No future work is expected from this RFC”? Or something like “Adjust hydra intervals and build shares depending on the load that will be observed”?

Actually, maybe reevaluating the choice of “small fixes in master” vs. “small fixes in staging-next too” could make sense after some experimentation, depending on how frequently we manage to merge staging into master with this new scheme.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now "future work" is filled, basically in sense "put this RFC into action".

start-date: (fill me in with today's date, YYYY-MM-DD)
author: (name of the main author)
co-authors: (find a buddy later to help our with the RFC)
related-issues: (will contain links to implementation PRs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be filled in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



The check interval of `staging` is reduced from 24 hours to 12 hours. This can
be done because only stabilization fixes shall be submitted and thus fewer
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence seems to contradict the definition of the branches above. "only stabilization fixes" is true for staging-next, not staging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also slightly confused about what staging* branch is intended for pull requests vs stabilisation after reading this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to staging-next.

Confusion: I believe that any fixes-only changes should go to staging-next (counting only problems present in the branch already). I would hope that the current text makes it clear:

  • staging-next is branched from staging and only fixes to stabilize and security fixes shall be delivered to this branch. This branch shall be merged into master when deemed of sufficiently high quality.

but certainly feel free to suggest a clearer formulation.

The exact mechanism how they get there doesn't seem too important to me. Ideally they get committed to staging-next directly, but if they might be in staging already (accidentally) and get cherry-picked to staging-next...

[drawbacks]: #drawbacks

A potential drawback of this new workflow is that the additional branch may be considered complicated and/or more difficult to work with.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, most contributors would keep working as before: choose master or staging depending on the number of rebuilds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, added.

- `master` is the main branch and all small deliveries shall go here;
- `staging` is branched from `master` and mass-rebuilds and other large deliveries go to this branch;
- `staging-next` is branched from `staging` and only fixes to stabilize and security fixes shall be delivered to this branch.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify how changes end up in master. From the discussion, my understanding is that only staging-next (never staging directly) is eventually merged into master.
Do we want to set criteria for when staging-next is "good enough" for merging to master?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified.


Reducing the size of the Hydra jobset would mean the iteration pace could be
higher, but has the downside of testing fewer packages, and having fewer binary
substitutes available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here is to reduce the channel-blocking Hydra jobset size, but have a 48-hourly cycle of trying to build the long tail. Still probably increases amount of things broken in the channel, but might be a middle ground. (I am not saying I prefer that compromise to the current proposal)

Copy link
Member Author

@vcunat vcunat Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel-update criteria contain waiting for all builds to finish, but the thing is doable by having two jobsets on the branch. Personally I think the channel-blocking set should grow and not shrink, especially for the release branches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I also think it is good to have options enumerated and explicitly dismissed with a reason given.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence describing the approach.

@7c6f434c
Copy link
Member

I think you should remove the first line of the initial message now.


# Drawbacks
[drawbacks]: #drawbacks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the proposal, I am not completely sure if it will in fact reduce or increase the overall number of jobs for Hydra to build. It would increase usefuk information for unit of build time, but it might increase the load.

This risk should either be mentioned as a possible drawback or there should be a discussion why this risk is not likely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we will see how the load turns out in real life and adjust shares/intervals or even the whole workflow. People actively working on staging stabilization should also be able cancel jobs and force evaluations.

In any case, I would like to push towards more power in Hydra over the long term, as I believe it will help with mass rebuilds and human time is more precious than machine time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about Hydra access (this is a part of proposed workflow, right?), and I agree in principle about expanding Hydra checks.

I just want things mentioned explicitly.

I think such explicit discussion could also be a way to better understand the trade-offs relative to complicated alternatives: staging (release-small) -> staging-next (release-small) -> staging-next (release) (manually triggered, unless Hydra can have jobsets with preconditions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I was always doing cancels and evals on Hydra when working with staging. The right to manage Hydra jobsets doesn't seem more security-sensitive that push access to nixpkgs. Overall, this text so far hasn't mentioned people – who will be doing this staging stabilization, etc.

Complicated alternatives: another thing to consider is that Hydra seems likely to be over-powered on x86 linux (relative to other platforms), and even for other reasons it makes sense for most changes to stabilize them for x86_64-linux first/deeper before others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it doesn't mention people but it describes what tools these people are expected to use (and how). You do discuss possible evaluation frequencies.

And yes, I would find it nice if you mentioned possible using buildpower disbalance.

@vcunat
Copy link
Member Author

vcunat commented Jun 24, 2018

@7c6f434c: which part do you mean?

@7c6f434c
Copy link
Member

The text is not empty anymore.

@vcunat
Copy link
Member Author

vcunat commented Jun 24, 2018

I see, I was just looking at the RFC text :-)

Copy link

@orivej orivej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ready for implementation.

@oxij
Copy link
Member

oxij commented Oct 5, 2018 via email

@xeji
Copy link

xeji commented Oct 5, 2018

I agree. staging-next is a confusing name. Initially I thought it meant the next stuff to go into staging...

staging-next fits the man gitworkflows docs nicely, where pu (staging) -> next (staging-next) -> master.

Why not simply call it next then?

@asymmetric
Copy link
Contributor

Is there any agreement on how to proceed here? It's a bit unfortunate that we're stuck on a naming issue 😅

Let's throw a dice and let the universe decide.

@LnL7
Copy link
Member

LnL7 commented Nov 14, 2018

This should just be merged. The proposal is in use and nitpicking about the name seems pointless to me.

@zimbatm zimbatm merged commit 56f5acd into NixOS:master Nov 15, 2018
@zimbatm
Copy link
Member

zimbatm commented Nov 15, 2018

Merging as it's implemented

@jtojnar
Copy link

jtojnar commented Nov 27, 2018

Is the following diagram correct?

digraph {
    "small changes" [shape=none]
    "mass-rebuilds and other large changes" [shape=none]
    "critical security fixes" [shape=none]
    "broken staging-next fixes" [shape=none]

    "small changes" -> master
    "mass-rebuilds and other large changes" -> staging
    "critical security fixes" -> master
    "broken staging-next fixes" -> "staging-next"

    "staging-next" -> master [color="#E85EB0"] [label=stabilization] [fontcolor="#E85EB0"]
    "staging" -> "staging-next" [color="#E85EB0"] [label=stabilization] [fontcolor="#E85EB0"]

    master -> staging [color="#5F5EE8"] [label="any time"] [fontcolor="#5F5EE8"]
}

Open in Editor

@Mic92
Copy link
Member

Mic92 commented Nov 27, 2018

I think we can append this graph to the RFC.

@zimbatm
Copy link
Member

zimbatm commented Nov 27, 2018

Has the RFC workflow been copied to nixpkgs yet? RFC are just here to gain consensus, the live documentation should live somewhere else.

@LnL7
Copy link
Member

LnL7 commented Nov 28, 2018

@jtojnar I guess it's not really mentioned in the RFC but I think in general even critical security fixes should go to staging-next first.

This obviously depends on the size of the rebuild and current state of staging-next, but in my experience having a large queue on master is detrimental for the stability there and often results in breakages/issues that hold back channel updates.

@Ekleog
Copy link
Member

Ekleog commented Nov 28, 2018

It's actually in the RFC: (emphasis mine)

staging-next is branched from staging and only fixes to stabilize and security fixes shall be delivered to this branch.

(though granted I can't find a place where it explicitly states that security mass-rebuilds don't go in master)

@jtojnar
Copy link

jtojnar commented Nov 29, 2018

I understood that as fixes to stabilize staging and fix vulnerabilities on staging.

That interpretation is consistent with the doctrine that critical security fixes should go directly to the branch that is affected by the vulnerabilities. Not sure if this doctrine still applies with this RFC accepted.

@nixos-discourse
Copy link

This pull request has been mentioned on Nix community. There might be relevant details there:

https://discourse.nixos.org/t/release-process-staging-branches/2799/2

@nixos-discourse
Copy link

This pull request has been mentioned on Nix community. There might be relevant details there:

https://discourse.nixos.org/t/release-process-staging-branches/2799/3

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/maintainers-needed-for-merging-staging-staging-next-branches/3833/3

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-shell-on-a-python-package-gives-a-pip-error/4137/4

@jtojnar
Copy link

jtojnar commented Dec 5, 2020

I bring you the latest diagram:

Staging workflow

Now the reverse flow (master → staging-next → staging) happens automatically every six hours thanks to NixOS/nixpkgs#105153.

digraph {
    "small changes" [shape=none]
    "mass-rebuilds and other large changes" [shape=none]
    "critical security fixes" [shape=none]
    "broken staging-next fixes" [shape=none]

    "small changes" -> master
    "mass-rebuilds and other large changes" -> staging
    "critical security fixes" -> master
    "broken staging-next fixes" -> "staging-next"

    "staging-next" -> master [color="#E85EB0"] [label="stabilization ends"] [fontcolor="#E85EB0"]
    "staging" -> "staging-next" [color="#E85EB0"] [label="stabilization starts"] [fontcolor="#E85EB0"]

    master -> "staging-next" -> staging [color="#5F5EE8"] [label="every six hours/any time"] [fontcolor="#5F5EE8"]
}

Open in Editor

@Ericson2314
Copy link
Member

^ This is great, should go in the Nipxkgs manual.

@jtojnar
Copy link

jtojnar commented Dec 5, 2020

See NixOS/nixpkgs#105986 😉

@FRidh
Copy link
Member

FRidh commented Dec 10, 2020

Commenting on @jtojnar NixOS/nixpkgs#106302

This is a security fix so it should go to master according to #26. Actually, that says staging-next so I am confused because I kept hearing master and even repeated that myself.

Yes I think we need to revise/clarify what is written there. The rfc indeed says staging-next, however it lacks a motivation.
Regarding master versus staging-next, there are pros and cons to both. I don't recall what I was thinking about this at the time; the biggest issue back then was the influx of changes on staging all the time, making stabilization impossible.

Critical security issues

I'd argue that critical issues should go to master, however if staging-next is as good as ready, that should merged right as well as well because that saves rebuilds. At that point it does not matter whether the fix goes to master or staging-next regarding the amount of rebuilds. It does matter regarding channel updates; during the rebuild there won't be any updates due to other smaller changes, but that is not very important. What is important is that nixos-small gets updated.

This obviously depends on the size of the rebuild and current state of staging-next, but in my experience having a large queue on master is detrimental for the stability there and often results in breakages/issues that hold back channel updates.

This I agree with. So this is a trade-off, and I think in case of critical security fixes the delay of a day or two due to the rebuilds is worth it.

One should however also note that the security fix is typically also backported. A rebuild of the stable branch has a higher priority and that causes the blocking of master for a period longer than a day or two.

Non-critical security fixes

These go through staging as usual, to reduce the amounts of rebuilds necessary.

@vcunat vcunat deleted the staging branch April 14, 2021 08:03
KAction pushed a commit to KAction/rfcs that referenced this pull request Apr 13, 2024
* staging workflow: init empty RFC

* 0026-staging-workflow: first draft

* 0026-staging-workflow: second draft

Close vcunat#2

* 0026-staging-workflow alternatives: more discussion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet