Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] nixos/systemd-boot: boot counting and automatic fallback #84204

Closed

Conversation

danielfullmer
Copy link
Contributor

@danielfullmer danielfullmer commented Apr 3, 2020

Motivation for this change

This draft PR adds support for systemd-boot "automatic boot assessment".
For background information, please first read the following document: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/

Setting boot.loader.systemd-boot.counters.enable = true enables this optional functionality.
Since what a "successful boot" exactly means is somewhat context dependent, systemd provides boot-complete.target which serves as a synchronization point that an end user could order necessary services before. The boot entry will be marked as successfully booting only if this target succeeds.
For unattended boots, there are some other potentially useful options such as adding panic=1 to boot.kernelParams or setting FailureAction=reboot for certain necessary services. However, these are end-user concerns that are somewhat orthogonal to boot counting and automatic fallback, and they probably do not belong in this PR.

This is a draft PR, as there are a few issues that still need to be resolved:

  • The menu entries appear to be ordered incorrectly upstream: systemd-boot menu ordering and boot counting systemd/systemd#15256 (fixed upstream, should be in v246)
  • The systemd-boot test needs a writable boot disk (since boot counters are state included in the entry filenames). I have a hack enabling this, which should be cleaned up somehow.
  • Documentation is needed

(edit: the issue below has been resolved to my satisfaction)

Perhaps the biggest current issue which could use some feedback on is how we set the "default" for systemd-boot. Normally, systemd-boot-builder.py sets the default entry in loader.conf. The logic in systemd-boot means that this entry would always be selected--overriding anything having to do with boot counters. I've disabled setting default in this draft PR, and so systemd-boot should select the latest generation which does not have 0 tries left. However, this would break the ability for the user to switch back to a previous generation, which is clearly undesirable. Perhaps we need to further patch systemd-boot to use default only if it has remaining tries left, and otherwise select the latest generation with remaining tries left.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@grahamc
Copy link
Member

grahamc commented Apr 11, 2020

cc @adisbladis for interest

@flokli
Copy link
Contributor

flokli commented Apr 23, 2020

@danielfullmer with 245.3 merged into master, can you rebase this PR and see if it's still broken upstream?

@danielfullmer
Copy link
Contributor Author

danielfullmer commented Apr 23, 2020

@flokli Sure, just rebased. There haven't been any relavent changes to systemd-boot in the meantime.

Since I haven't received any response about the menu ordering here systemd/systemd#15256, I'll send a patch upstream to see if that sparks any discussion.

I also recently patched systemd-boot to ignore default entries which have no tries left. That allows me to remove the hack where I commented out the default setting in systemd-boot-builder.py. I'll see if I can get any upstream feedback on whether the original behavior was intended or not.

Just to be clear, the logic is now this: We select the default entry specified in loader.conf (which should be the current generation), unless this entry has zero tries left. In that case, we select the entry with the highest id (entry filename--basically, generation number) which has a nonzero number of tries left.

Upstream PR here: systemd/systemd#15577

@flokli
Copy link
Contributor

flokli commented Jun 14, 2020

I just prepared the bump to 245.6, which doesn't seem to include systemd/systemd#15577 - do we need to wait for 9eaac58683788ae1870760ad61d594c8ce7a43a8 to have landed in our systemd package?

We also might need to address #86422 (comment), so actual sd-boot fixes manifest on existing installations.

@danielfullmer
Copy link
Contributor Author

@flokli Yeah, it's probably best to wait for the next major systemd release before merging.

I actually have a little more free time now. So I'll see if I can take a look at #86422 again either today or tomorrow.

@danielfullmer
Copy link
Contributor Author

I've resolved the earlier issue with the default selection to my satisfaction. To review, the problem is that systemd will always select an entry matching default set in loader.conf, even if that entry didn't have any tries left. Since we currently set default to only match a single entry, it would never fallback to any other entry.

My fix was this: instead of setting (for example) default nixos-generation-15, we now set default nixos-generation-*, and rely on the systemd-boot's lexicographic ordering of the entry filenames to ensure we actually default to the proper entry. This way, if the default boot entry has no tries left, there could still be another entry matching nixos-generation-* which systemd-boot could try instead.

Since the proper default generation might not be the highest numbered generation, I've renamed the default entry to (for example) nixos-generation-default-15. This would be compared against other entries named (for example)nixos-generation-20. Since 'd' is lexicographically greater than '1' (or any other digit), nixos-generation-default-15 is ordered last, which is the one that systemd-boot chooses as the default selection. This does seem a little hackish, but I think this is the best approach. I will add comments to the source explaining this.

This approach has the additional benefit that non-default system profiles would be respected. If the user has run nixos-rebuild boot --profile-name test, and the latest test generation is not booting correctly, systemd-boot should only choose other test generations, not any other profile's generations.

@cole-h
Copy link
Member

cole-h commented Mar 22, 2021

Hi @danielfullmer,

I just wanted to swing by and say: thank you for your work on this! I tested this on my machine just now and it works flawlessly.

Is there anything I can do to help move this into a mergeable state?

EDIT: Removed from the Staging project because it looks like this is no longer a mass rebuild.

@cole-h cole-h removed this from WIP in Staging Mar 22, 2021
Copy link
Member

@cole-h cole-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I just noticed (after having switched off of using this PR) -- I still have boot entries with counters attached, but I also have the same profiles without counters. e.g.:

$ ls /boot/loader/entries
[...]
.rwxr-xr-x 690 root 22 Mar 11:49 nixos-generation-70+0-1.conf
.rwxr-xr-x 580 root 23 Mar 13:29 nixos-generation-70.conf
.rwxr-xr-x 690 root 22 Mar 11:49 nixos-generation-71+0-1.conf
.rwxr-xr-x 580 root 23 Mar 13:29 nixos-generation-71.conf
.rwxr-xr-x 690 root 22 Mar 11:49 nixos-generation-72+0-1.conf
.rwxr-xr-x 580 root 23 Mar 13:29 nixos-generation-72.conf
.rwxr-xr-x 690 root 22 Mar 11:49 nixos-generation-73+0-3.conf
.rwxr-xr-x 580 root 23 Mar 13:29 nixos-generation-73.conf
.rwxr-xr-x 580 root 23 Mar 13:29 nixos-generation-74.conf
.rwxr-xr-x 580 root 23 Mar 13:29 nixos-generation-76.conf
.rwxr-xr-x 690 root 22 Mar 11:49 nixos-generation-default-74+3.conf

If boot counters are ever disabled and we see that there are profiles with a counter attached, what should be done? Nothing (and continue to regenerate the counted profiles on nixos-rebuild switch)? Skip it (e.g. don't regenerate that profile)?


generation_dir = os.readlink(system_dir(self.profile, self.generation))
tmp_path = self.path + ".tmp"
kernel_params = "systemConfig=%s init=%s/init " % (generation_dir, generation_dir)
Copy link
Member

@cole-h cole-h Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of 5276ebb, systemConfig has been removed.

Suggested change
kernel_params = "systemConfig=%s init=%s/init " % (generation_dir, generation_dir)
kernel_params = "init=%s/init " % generation_dir

@stale
Copy link

stale bot commented Sep 21, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Sep 21, 2021
@RaitoBezarius
Copy link
Member

This still looks interesting, can we do something to help?

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Oct 17, 2021
@stale
Copy link

stale bot commented Apr 16, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 16, 2022
@RaitoBezarius
Copy link
Member

Still interested, can we help you @danielfullmer ?

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 16, 2022
@danielfullmer
Copy link
Contributor Author

Man this has been on my TODO list for so long, apologies for not taking it the last 20% and trying to get it merged. I now have a concrete need for this feature so I intend to work on this soon (on the order of weeks). If anyone gets to it before I do: I know we'll at least need to (1) Resolve any conflicts and ensure the feature still works (2) Figure out a solution to @cole-h 's issue of how to handle boot entries if we turn the feature off and on, and (3) Write some documentation/release notes for the feature. This is very likely too late to make it into 22.05, but it'd be nice to get it merged into master for the release after that.

@lovesegfault
Copy link
Member

@danielfullmer Gentlest of pings :)

@dasJ
Copy link
Member

dasJ commented Jul 26, 2022

Partly related, I have a half-baked implementation of this for grub at home. Main issue is that it's shitty to implement when we want proper counting and that I cannot get the test runner to do what I want

@domenkozar
Copy link
Member

As part of Cachix Deploy, I'd very much like to sponsor this work to get it fixed for everyone in NixOS itself. Let me know if someone is willing to get it done :)

@danielfullmer
Copy link
Contributor Author

@domenkozar I'm actually currently working on bringing this PR up-to-date with master right now. I know I've been meaning to work on this for a while now, and I don't want to prevent you from sponsoring someone else to work on this if needed. But if I don't have something within ~2 weeks, then take that as a sign that I might not get to it for a while longer.

@samueldr
Copy link
Member

samueldr commented Aug 9, 2022

@domenkozar I'm actually currently working on bringing this PR up-to-date with master right now. I know I've been meaning to work on this for a while now, and I don't want to prevent you from sponsoring someone else to work on this if needed. But if I don't have something within ~2 weeks, then take that as a sign that I might not get to it for a while longer.

Are you unable to get sponsored for this? Not sure if that's what @domenkozar could have had in mind.

@danielfullmer
Copy link
Contributor Author

Are you unable to get sponsored for this? Not sure if that's what @domenkozar could have had in mind.

It might be complicated for some unrelated reasons--but I'll reach out directly.

@RaitoBezarius
Copy link
Member

As part of Cachix Deploy, I'd very much like to sponsor this work to get it fixed for everyone in NixOS itself. Let me know if someone is willing to get it done :)

I'm interested into taking over for this and bringing this to completion if @danielfullmer cannot really anymore.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/where-did-you-get-stuck-in-the-nix-ecosystem-tell-me-your-story/31415/6

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/pi4-broken-on-23-11-uefi-zfs-root/36375/3

@infinisil infinisil added the significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc. label Dec 3, 2023
@n8henrie
Copy link
Contributor

n8henrie commented Dec 8, 2023

@dasJ Could you share more about your grub implementation?

Would love to see this PR go forward!

@RaitoBezarius
Copy link
Member

Closing in favor of #273062 which is close to completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos 8.has: module (update) 10.rebuild-darwin: 0 10.rebuild-linux: 1-10 significant Novel ideas, large API changes, notable refactorings, issues with RFC potential, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet