Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NixOS/auto-upgrade: offer the possibility to define a reboot window during which the system may be automatically rebooted #77622

Merged
merged 1 commit into from Mar 17, 2022

Conversation

r-vdp
Copy link
Contributor

@r-vdp r-vdp commented Jan 13, 2020

Motivation for this change

Some systems should not be rebooted at just any time. If the upgrade process takes too long, for instance because of a slow internet connection, or if the upgrade service is ran during production hours, we want to allow to define a window outside of which a reboot will not be performed.
The system will then reboot on the next run of the upgrade service which finishes inside the reboot window.

E.g. we can run the update service twice per week, once during the night and once during the day, but reboots are only allowed during the night. By doing so, a system that is usually shut down during the night will still receive updates and systems that are turned on 24/7 can be rebooted outside of production hours.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@r-vdp
Copy link
Contributor Author

r-vdp commented Jan 13, 2020

@danbst, could you review this?

@r-vdp
Copy link
Contributor Author

r-vdp commented Jan 25, 2020

@danbst, thanks! Anything else needs to happen to have this merged?

@danbst
Copy link
Contributor

danbst commented Jan 27, 2020

@r-vdp sorry, I wait for more 👀 . But I don't quite know whom to ping for another review.
I'll redirect this to 20.03 release managers @NixOS/nixos-release-managers

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/what-are-your-goals-for-20-03/4773/14

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-already-reviewed/2617/92

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-20-03-feature-freeze/5655/29

@rnhmjoj
Copy link
Contributor

rnhmjoj commented Feb 8, 2020

I haven't fully tried it yet but it looks all right.
I wrote a NixOS test for this but unfortunately it doesn't fully work: since the VM has no internet access Nix can't rebuild the system. I hoped it could pick derivations from the host nix store but it doesn't really work. If somebody wants to try fixing it it's here: rnhmjoj@01d1284

Copy link
Member

@infinisil infinisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't show correct behavior when the time span goes through midnight. E.g. with lower = "23:00", upper = "02:00" and let's say it runs on Tuesday 01:00, then the condition makes it only reboot if "Tuesday 01:00 is before Tuesday 02:00 and after Tuesday 23:00", which won't be the case.

'';
${optionalString (cfg.rebootWindow != null) ''
elif [[ "''${current_time}" < "${cfg.rebootWindow.lower}" ]] || \
[[ "''${current_time}" > "${cfg.rebootWindow.upper}" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wrong. Example is lower = 12:00, upper = 6:00, current = 18:00 -> This is within the window, but the code doesn't do it then.

After a bit of thinking, I'd say the best way to correctly implement this is to: Determine if lower > upper, and if so, flip them around. Then check whether lower < cur < upper, and toggle this result if lower > upper from before. If the result is true, do the reboot, otherwise don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infinisil quite late, but I implemented this change just now.

@stale
Copy link

stale bot commented Jun 6, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 6, 2021
@r-vdp r-vdp force-pushed the nixos-upgrade-reboot-window branch from 854de96 to f6e06a8 Compare October 27, 2021 15:21
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Oct 27, 2021
@r-vdp r-vdp force-pushed the nixos-upgrade-reboot-window branch from f6e06a8 to 687196c Compare October 27, 2021 15:32
@r-vdp r-vdp requested review from infinisil and removed request for worldofpeace October 28, 2021 16:05
@infinisil
Copy link
Member

Taking another look at this, unless I'm missing something, I think there's a big flaw with this: It's not useful, because the upgrade time is already fixed by the system.autoUpgrade.dates options. Assuming system.autoUpgrade.allowReboot is enabled:

  • If system.autoUpgrade.dates is within system.autoUpgrade.rebootWindow, then it always reboots at 4:40 if there was a kernel/initrd update.
  • Otherwise it never does, even if there are kernel/initrd updates, because it always happens at the same time.

As such, the rebootWindow option doesn't really do anything. Yes there's system.autoUpgrade.randomizedDelaySec, but that doesn't make it much better.

@r-vdp
Copy link
Contributor Author

r-vdp commented Dec 9, 2021

Taking another look at this, unless I'm missing something, I think there's a big flaw with this: It's not useful, because the upgrade time is already fixed by the system.autoUpgrade.dates options. Assuming system.autoUpgrade.allowReboot is enabled:

* If `system.autoUpgrade.dates` is within `system.autoUpgrade.rebootWindow`, then it always reboots at 4:40 if there was a kernel/initrd update.

* Otherwise it never does, even if there are kernel/initrd updates, because it always happens at the same time.

As such, the rebootWindow option doesn't really do anything. Yes there's system.autoUpgrade.randomizedDelaySec, but that doesn't make it much better.

@infinisil my use case is that we have many servers that are in very remote locations on often slow uplinks (sometimes satellite). As such, when an upgrade requires downloading a lot of new store paths, it may not finish before working hours start again, and so I want to be able to make sure that under no circumstances our servers reboot during working hours.

We also have servers that are turned off outside of working hours, due to power restrictions. For those, we run the upgrade service during working hours when the servers are turned on, but I don't want them to reboot at that time, they will activate the new generation the next morning when they are started again. As our servers are located in volatile contexts, we try to configure them as generically as possible so that they are operational no matter what the local staff decides is best in terms of uptime and power usage.

Does that make sense?

@infinisil
Copy link
Member

I see, that does make more sense. That does make me think of an alternative approach that could be a bit more deterministic:

  • Have a timer which nixos-rebuild build's the system every e.g. hour, and have it detect when there's a change (by comparing /nix/store hashes)
  • Depending on whether an activation and a reboot is needed:
    • If neither is needed, don't do anything
    • If only an activation is needed, and we're within system.autoUpgrade.activationWindow (a new option), do an activation, otherwise don't do anything
    • If only a reboot is needed, and we're within system.autoUpgrade.rebootWindow, do a reboot, otherwise don't do anything
    • If both an activation and a reboot is needed:
      • If we're not within activationWindow, don't do anything (we can't reboot either, since we need to have the current system be activated first)
      • Otherwise, do an activation
        • If we're also within rebootWindow, do a reboot after activation

This would mean that you can control when both activation and reboots can happen, and if the script makes sure to gcroot the system derivation when it first builds it (even if it can't activate/reboot yet), you can be sure that it will do the activation/reboot the next time it can, since everything is built already. How does that sound?

In any case, I don't think this idea conflicts with this PR and can be implemented later.

nixos/modules/tasks/auto-upgrade.nix Outdated Show resolved Hide resolved
@r-vdp
Copy link
Contributor Author

r-vdp commented Dec 9, 2021

I see, that does make more sense. That does make me think of an alternative approach that could be a bit more deterministic:

* Have a timer which `nixos-rebuild build`'s the system every e.g. hour, and have it detect when there's a change (by comparing `/nix/store` hashes)

* Depending on whether an activation and a reboot is needed:
  
  * If neither is needed, don't do anything
  * If only an activation is needed, and we're within `system.autoUpgrade.activationWindow` (a new option), do an activation, otherwise don't do anything
  * If only a reboot is needed, and we're within `system.autoUpgrade.rebootWindow`, do a reboot, otherwise don't do anything
  * If both an activation and a reboot is needed:
    
    * If we're not within `activationWindow`, don't do anything (we can't reboot either, since we need to have the current system be activated first)
    * Otherwise, do an activation
      
      * If we're also within `rebootWindow`, do a reboot after activation

This would mean that you can control when both activation and reboots can happen, and if the script makes sure to gcroot the system derivation when it first builds it (even if it can't activate/reboot yet), you can be sure that it will do the activation/reboot the next time it can, since everything is built already. How does that sound?

In any case, I don't think this idea conflicts with this PR and can be implemented later.

That's an interesting suggestion indeed!

There is one issue that I see with this approach though. On our servers we make the difference between a deployment of a new config from our git repository (eg addition of a new user or an updated ssh key, updated definition of a service, etc) which never requires a reboot, and an actual upgrade. Only during an upgrade do we we also update the nix channel and so do we potentially need to reboot after.

Changes to our git repo are under certain conditions deployed automatically to our servers. However, if we upgrade every hour and then wait until we're in the reboot window to reboot, then the nix channel can already have been updated when a config change is being deployed and so at that point, if a reboot was required, we cannot easily activate this new config without also switching the kernel, restarting services, etc.

I also don't think that there's an easy way to roll back the nix channel to the point from which the current generation was derived.

All of this would not necessarily be an issue, but I noticed in the past that activating a config with an updated kernel without rebooting can cause issues where we can for instance no longer exec into docker containers, which is quite annoying for our support staff. Besides the fact that restarting services during production hours can also cause issues.

So, at least for our use case, it's important that we only do an upgrade when we'll actually be able to reboot the server.

When the upgrade service exceeds the reboot window, we obviously also end up in the described situation, but this is assumed to be quite exceptional since such long running updates are not common, usually only on a new release.

Copy link
Member

@infinisil infinisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry haven't had a lot of time recently, but I should have some now. This is the only nit that I think should still be addressed, after that I'll merge it :)

nixos/modules/tasks/auto-upgrade.nix Outdated Show resolved Hide resolved
nixos/modules/tasks/auto-upgrade.nix Outdated Show resolved Hide resolved
…uring which the system may be automatically rebooted

Some systems should not be rebooted at just any time. If the upgrade process takes too long, for instance because of a
slow internet connection, or if the upgrade service is ran during production hours, we want to allow to define a window
outside of which a reboot will not be performed.
The system will then reboot on the next run of the upgrade service which finishes inside the reboot window.

E.g. we can run the update service twice per week, once during the night and once during the day, but reboots are only
allowed during the night. By doing so, a system that is usually shut down during the night will still receive updates
and systems that are turned on 24/7 can be rebooted outside of production hours.

Co-authored-by: Silvan Mosberger <github@infinisil.com>
@r-vdp r-vdp force-pushed the nixos-upgrade-reboot-window branch from 24798df to 39f3eb3 Compare March 9, 2022 07:18
@r-vdp
Copy link
Contributor Author

r-vdp commented Mar 9, 2022

@infinisil great, thanks!

I incorporated the proposed changes, squashed everything together, and rebased the PR on top of master.

@r-vdp r-vdp requested a review from infinisil March 17, 2022 07:28
Copy link
Member

@infinisil infinisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, let's merge this finally!

@infinisil infinisil merged commit 839b9b8 into NixOS:master Mar 17, 2022
@r-vdp r-vdp deleted the nixos-upgrade-reboot-window branch March 17, 2022 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet