New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nixos-containers: fix atomic restart #80169
base: master
Are you sure you want to change the base?
Conversation
Fixes NixOS#43652 Fixes NixOS#16753 Alternative fix for NixOS#39717 The fix in NixOS#76719 was not enough `--keep-unit` ties systemd-nspawn container to systemd unit. When unit is restarted (ie atomic operation "restart", not two separate "stop" and "start") systemd assumes that unit didn't disappear. Hence machine won't disappear. If unit is stopped (or stopped due to failure), associated machine is also stopped, so service can start fresh. Resource slice isn't changed though - it is machine.slice by default for nspawn containers.
if this will be merged, we can go on and cleanup the artifacts from old solutions - WINCH signal, mixed killmode, PID mgmt in imperative containers, restart as stop/start... |
cc @arianvp here. |
Previously, we were storing the leader pid in a runtime file and signalled SIGRTMIN+4 manually. In systemd 219, the `machinectl poweroff` command was introduced, which does that for us.
This seems fine. though im a bit confused by why it was originally introduced. I also don't understand why we need
From my understanding,
|
So I think, by merging this change, we can completely get rid of the This is the same issue we're fixing as documented here: systemd/systemd#2770 |
@danbst this looks good - can you do the above mentioned cleanups in this PR, too? |
We'll have to keep
|
@danbst poke ;-) |
@flokli I'm debugging tests, they fail on |
* machinectl terminate during container start was a hack, probably wrong one * use "systemctl restart" consistently for declarative and imperative containers * fix "nixos-container terminate" command. We have to notify systemd we want unit to be stopped, because otherwise it will be started back due to restart on failure policy. * add a bit docs to stop/terminate/destroy, to clarify which one does what * add missing "nixos-container restart" docs
Cleanups (from commit message):
@arianvp I was not able to remove Also, with this PR |
@GrahamcOfBorg test containers-imperative |
@@ -25,7 +25,7 @@ sub showHelp { | |||
nixos-container create <container-name> | |||
[--nixos-path <path>] | |||
[--system-path <path>] | |||
[--config <string>] | |||
[--config <string>] -- config without outer braces, for example 'services.nginx.enable = true;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already documented in nixos/doc/manual/administration/imperative-containers.xml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine to duplicate it here. The hands-on --help
is too scarce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @danbst here. This is a common misunderstanding (which usually seems like a bug to an end-user), so mentioning this here as well is fine IMHO.
nixos-container restart <container-name> | ||
nixos-container stop <container-name> -- shutdown container cleanly, wait until stopped | ||
nixos-container terminate <container-name> -- halt container, like hard poweroff | ||
nixos-container destroy <container-name> -- terminate (halt) container and remove state data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should move into nixos/doc/manual/administration/imperative-containers.xml
(if not already there). Why were things reordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, technically yes, but it's fine duplicating it here. No need to consult manual when stopping container.
Things were reordered to group stop/terminate/destroy commands. They do same, but with different nuances.
system("systemctl", "stop", "--no-block", "container\@$containerName"); | ||
system("machinectl", "terminate", $containerName) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urgh, this feels like a hack. Doesn't machinectl terminate
by itself make the systemd unit become stopped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is consequence of removing --keep-unit
, I haven't found a better way to stop unit with machine.
machinectl stop
does stop unit fine. It is because machine exits with normal result. But when it runs terminate
, it exits with failure result, and systemd restarts unit due to Restart=on-failure
policy.
I found a more proper fix (that cleanly shuts down the container with just
systemctl stop). We have a slight difference between the way upstream calls
systemd-nspawn. I have a patch ready and I'll test if it works tomorrow.
Wil also get rid of the WINCH hacks so that's good too.
This serves as a reminder to send that tomorrow when I have internet again
:)
…On Wed, Feb 26, 2020, 23:06 Danylo Hlynskyi ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pkgs/tools/virtualization/nixos-container/nixos-container.pl
<#80169 (comment)>:
> nixos-container status <container-name>
nixos-container update <container-name>
+ nixos-container restart <container-name>
+ nixos-container stop <container-name> -- shutdown container cleanly, wait until stopped
+ nixos-container terminate <container-name> -- halt container, like hard poweroff
+ nixos-container destroy <container-name> -- terminate (halt) container and remove state data
well, technically yes, but it's fine duplicating it here. No need to
consult manual when stopping container.
Things were reordered to group stop/terminate/destroy commands. They do
same, but with different nuances.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80169?email_source=notifications&email_token=AAEZNI23LM57XO22RH7DDILRE3OR5A5CNFSM4KVWU6W2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCXCXNEQ#discussion_r384782585>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEZNIY7R2OBVCY2SP6GNNDRE3OR5ANCNFSM4KVWU6WQ>
.
|
@arianvp kind ping :) |
My 'fix' didn't really help unfortunately. I hit what I think is a bug in
systemd systemd/systemd#14961
(The fix was setting KillSignal=SIGRTMIN+3 with KeepUnit, such that
systemd-nspawn gracefully kills the underlying container on SIGTERM. This is the default for systemd-nspawn in --boot mode, but because we can't use boot mode in NixOS because the systemd binary in the container isn't in the expected place, you have to pass an additional --kill-signal).
Maybe it's something obvious that I'm doing wrong. I'll give it one more look Tuesday during Berlin NixOS meetup
…On Sun, Mar 1, 2020, 19:24 Danylo Hlynskyi ***@***.***> wrote:
@arianvp <https://github.com/arianvp> kind ping :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80169?email_source=notifications&email_token=AAEZNI43WFCPY22HCPMCFL3RFKR4DA5CNFSM4KVWU6W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENNGVNY#issuecomment-593128119>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEZNIYHWP5UM3WDTRGKOUDRFKR4DANCNFSM4KVWU6WQ>
.
|
I started working on a draft for improved nixos-containers (using `networkd` and `.nspawn` units) after the networkd hackathon[1] which isn't published yet. Quite recently I realized that when changing a `.nspawn`-unit, the `switch-to-configuration.pl` doesn't activate those changes. This patch takes care of it with the following changes: * It's possible to declare whether to restart or reload such a unit. The restart option is the default. In that case the `systemd-nspawn@<machine-name>.service`[2]-unit will be restarted or reloaded. * By default, all `.nspawn`-units are part of the `machines.target`. * A VM-test covers all those cases including a custom reload-script to activate a new configuration in the machine. * I had to remove the `--keep-unit` flag on startup to fix the restart of the unit. This is a known issue[3]. It's also possible to use a reload to activate a new configuration inside a nspawn-machine with a config like this: ``` nix { pkgs, ... }: { systemd.nspawn.test-container.reloadOnChange = true; systemd.nspawn.test-container.restartOnChange = false; systemd.services."systemd-nspawn@test-container".serviceConfig.ExecReload = "${pkgs.writeScriptBin "activate" '' #! ${pkgs.runtimeShell} -xe systemd-run --machine test-container --pty --quiet -- /bin/sh --login -c \ '${containerCfg}/bin/switch-to-configuration test' ''}/bin/activate"; } ``` [1] https://discourse.nixos.org/t/networkd-sprint-2019-11-23-24-in-munich/4578 [2] https://github.com/systemd/systemd/blob/v243/units/systemd-nspawn@.service.in [3] NixOS#80169
I'm okay with this approach for now if we are adamant on getting it out for 20.03 If not, I would like to wait until we complete the 245 release, and see if the issues I encountered in the linked systemd issue goes away. If we keep using |
Ping @danbst; this is still an issue in master, and it would be nice to have this fixed there at least. |
I marked this as stale due to inactivity. → More info |
nixos containers: remove --keep-unit (revert 810680b)
Fixes #43652
Fixes #16753
Alternative fix for #39717
The fix in #76719 was not enough
cc @erikarvstedt @peterhoeg @flokli @edolstra