Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grub installer: remove old kernels before copying new ones. #26165

Closed
wants to merge 1 commit into from

Conversation

abbradar
Copy link
Member

For this, first remember all files that need to be installed.
Then remove all files that are not in the list and copy new ones.

Motivation for this change

Fixes #23926.

Things done
  • Tested using sandboxing
    (nix.useSandbox on NixOS,
    or option build-use-sandbox in nix.conf
    on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • Linux
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Fits CONTRIBUTING.md.

Tested in two VMs -- with and without split /boot.

@mention-bot
Copy link

@abbradar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @wkennington, @edolstra and @symphorien to be potential reviewers.

@abbradar
Copy link
Member Author

  1. They indeed can but because they are not linked by any profile there is no guarantee those configurations weren't garbage collected. I think we can assume they were -- it's not safe to rely on them;
  2. Same logic -- if the profile wasn't still garbage collected (i.e. if you just did a switch or boot) it won't be removed, and if it was all bets are off. Anyway I don't think you can end up in a situation when you don't have any kernel because boot profile won't be collected until before a reboot [citation needed].

@abbradar
Copy link
Member Author

abbradar commented May 28, 2017

I dug a little around my last statement -- seems it's half-false. It won't be garbage collected but its profile will be removed (collector won't touch it because of /run/booted-system symlink). Because install-grub.pl looks only at profiles however it will happily remove it. So you could be left with no kernels in /boot following this route:

  1. nixos-rebuild switch
  2. nix-collect-garbage -d
  3. nixos-rebuild switch, cut the power in the middle of install-grub.pl

A proper way to fix this would be to not remove kernels from /run/booted-system until the configuration file was replaced. I'll look at this (possibly this is also a problem for systemd-boot installer).

@abbradar
Copy link
Member Author

Ugh, there is still a problem left however. Suppose you'd do this:

  1. nixos-rebuild switch
  2. nix-collect-garbage -d
  3. nixos-rebuild switch (at this point grub.cfg will be replaced and there would be no entry for /run/booted-system)
  4. nix-collect-garbage -d
  5. nixos-rebuild switch, cut the power when old kernels were removed

On the third switch even though kernel from /run/booted-system would still exist there would be no GRUB entry for it., so effectively you'd end up without kernels.

The only way this one can be fixed IMO is by always adding a boot entry that says "Generation booted at $date" which points to current /run/booted-system when current-system != booted-system.

For this, first remember all files that need to be installed.
Then remove all files that are not in the list and copy new ones.
@abbradar
Copy link
Member Author

abbradar commented May 28, 2017

After more thought this was a false alarm: after nix-collect-garbage -d the user would still have current system kernel in their /boot after the first switch. It won't be touched on (3) so everything seems to be okay but I need someone to confirm my judgement.

@dezgeg
Copy link
Contributor

dezgeg commented May 28, 2017

The current form of install-grub.pl isn't safe to power cuts anyway, since there is no fsync() in sight to transactionally update any of the files.

But regardless of that, I think you still need to do either do the "Generation booted at $date" thing or some other flag file in /boot to figure out a kernel to keep from the old grub.cfg to solve e.g. this:

  1. Boot with kernelA.
  2. nix-collect-garbage -d
  3. Fill /boot with nixos-rebuild switch'ing with different kernels (kernelB1, kernelB2, ...)
  4. nix-collect-garbage -d
  5. nixos-rebuild switch with kernelC

Also there is this case to end up with only garbage-collected generations in /boot:

  1. Boot with configA
  2. nix-collect-garbage -d
  3. nixos-rebuild switch to configB
  4. nixos-rebuild test to configC
  5. nix-collect-garbage

@abbradar
Copy link
Member Author

abbradar commented May 28, 2017

@dezgeg I feel that the first case is the same that I was initially worried about, but in the end I think it's handled correctly because on final nixos-rebuild switch only kernels kernelB{1..N-1} will be removed. The last kernel won't be because when nix-collect-garbage runs B_n will be the current active profile so it won't be collected. So even if you cut the power in the middle of that you'd be left with working B_n.

If I understand correctly nixos-rebuild test doesn't create new profile symlink (only adds a GC root) so in the second case you'll have configB as the last profile configuration and it won't be GCed.

@abbradar
Copy link
Member Author

@dezgeg Sorry for poking: any comments on the last reply?

@dezgeg
Copy link
Contributor

dezgeg commented Jun 20, 2017

No problem... I think the second case is ok but the first one has a problem of nix-env -p "$profile" --set "$pathToConfig" being executed whether or not $pathToConfig/bin/switch-to-configuration boot succeeds. So if you have:

  • nixos-rebuild switch which runs out of space on /boot (but still mark the new system in the active profile)
  • nix-collect-garbage -d
  • nixos-rebuild switch which now as the very first step nukes all the kernels

@abbradar
Copy link
Member Author

@dezgeg But I assumed that nix-collect-garbage -d won't collect current booted derivation -- won't it? If my assumption is correct in the third step nixos-rebuild switch won't collect this one kernel.

@dezgeg
Copy link
Contributor

dezgeg commented Jun 20, 2017

Sure, nix won't garbage-collect it, but the GRUB generator doesn't care about kernels that aren't referenced by the system profile generations (and step 2 removed the generation).

@abbradar
Copy link
Member Author

Indeed, I've confirmed this problem in a VM. It currently exists in our systemd-boot installer. One solution would be to postpone removal of files from /run/booted-system until after you have swapped the configuration. I'll try to implement this for GRUB and systemd-boot later.

@flokli
Copy link
Contributor

flokli commented Nov 18, 2017

@abbradar Did you already have a chance to look at this? Just got bitten by it, too ;-)

@mmahut
Copy link
Member

mmahut commented Aug 11, 2019

Are there any updates on this pull request, please?

@abbradar
Copy link
Member Author

I didn't look at this for a long time. Let's close this for now.

@abbradar abbradar closed this Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When /boot is full, system rebuilds fail
6 participants