Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd: Don't order systemd-user-sessions.service after network.target #61098

Closed

Conversation

gloaming
Copy link
Contributor

@gloaming gloaming commented May 7, 2019

Fix delayed startups waiting for DHCP, see #60900

The startup time issue seems much more annoying to me than the original SSH session issue, so I think we should merge this now. If others disagree, I guess we wait until we have a proper fix for the latter.

Motivation for this change

Improve boot times for local NixOS users.

Things done

I tested by overriding the package in configuration.nix:

systemd.package = pkgs.callPackage /home/craig/nixpkgs/pkgs/os-specific/linux/systemd/default.nix {};
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@worldofpeace
Copy link
Contributor

Instead of patching the upstream unit which causes an unecessarry mass rebuild, we can override it within the systemd nixos module

systemd.services.systemd-user-sessions.restartIfChanged = false; # Restart kills all active sessions.

@dtzWill
Copy link
Member

dtzWill commented May 8, 2019

Madness, all madness! 😆 .

This all seems like we've made things a bit of a mess? I apologize for any misunderstanding, but here's what it looks like:

  • wtf @ making everyone wait for network because other folks can't be bothered to use network-online.target if they need connectivity before running. (is this an accurate summary of the situation?)
  • Not having network disappear out from user services.. seems like the right/appropriate behavior. Please don't casually change that? Is the alternative ordering "remove network interfaces, THEN try to stop user services"?? Because if that's correct that seems rather hostile and likely problematic. At that point we might as well just make step one "cut power to the entire rack" and only afterwards signal user services... ;).
  • I don't understand why "waiting for dhcpcd to fork" is somehow a significant delay (but apparently it is, and I'm sure it has reasons that make sense <3), but reading our thread on the issue it appears what happened was someone said "my cloud-init use case would be improved as a result of this change" so the change was made but as a result (unexpectedly) a number of users report giant delays in day-to-day use of their actual machine.
    This does not seem like we made the right trade-off here, right? Ideally all use cases can be accommodated but I must say.... "why does your laptop take 30 seconds to boot, with a black screen with no indication of progress" => "because someone wanted to use cloud-init and apparently that means we have to intentionally degrade things for all" --- not at all hating on cloud-init but it seems poorly motivated. More importantly, IMO, there seems no technical justification for imposing this "cost" across all users-- other solutions or even just adding an option?

I suppose I'm inclined to say we just revert the other change and.... add a dependency on network-online,target from cloud-init? (err surely it's not that simple? sorry)


Thanks for working on this, hopefully we can get this sorted soon. See also #60954 , I'm working on related issues as well.

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not in favour of increasing our amount of patches on top of systemd.

Reading up on #60900 and #44524 (comment):

It seems adding dhcpcd.service to network.target is wrong, it should be requiredBy=network-online.target (only when dhcpcd is enabled globally), and cloud-init.service should set Wants/After=network-online.target. systemd provides (three) network-specific targets, and we should use them:

Units that strictly require a configured network connection should pull in network-online.target (via a Wants= type dependency) and order themselves after it. This target unit is intended to pull in a service that delays further execution until the network is sufficiently set up. What precisely this requires is left to the implementation of the network managing service.

As described in #44524 (comment), dhcpcd's fork behaviour might become a bit unpredictable in dualstacked configurations. But I hope this won't become too much of an headache, given it only affects defaults (and most people configure their networks somehow), and we'd like to switch over to networkd anyways, which allows much more fine-grade control on when to reach network-online.target.

@gloaming
Copy link
Contributor Author

gloaming commented May 8, 2019

Instead of patching the upstream unit which causes an unecessarry mass rebuild, we can override it within the systemd nixos module

Unfortunately, this doesn't appear to be possible for dependency configuration. All nix-specified unit options are placed in a drop-in file:

$ systemctl status systemd-user-sessions.service
● systemd-user-sessions.service - Permit User Sessions
   Loaded: loaded (/nix/store/nih5bga6fwphwm1la0wy2pxnv2wng675-systemd-239.20190219/example/systemd/system/systemd-user-sessions.service; enabled; vendor preset: enabled)
  Drop-In: /nix/store/mfmr9a17jg7j9bkrj8xwg4zlf29v2mgy-system-units/systemd-user-sessions.service.d
           └─overrides.conf

But drop-ins are unable to remove dependencies, only add them, as explained in the small print at the end of systemd.unit(5):

Note that for drop-in files, if one wants to remove entries from a setting that is parsed as a list (and is not a dependency), such as AssertPathExists= (or e.g. ExecStart= in service units), one needs to first clear the list before re-adding all entries except the one that is to be removed. Dependencies (After=, etc.) cannot be reset to an empty list, so dependencies can only be added in drop-ins. If you want to remove dependencies, you have to override the entire unit.

I suppose we ought to find a better way to deal with this.
Your comment did make me think to try overriding the derivation:

    systemd.package =
      pkgs.systemd.overrideAttrs (oldAttrs:
       { patches = [ /etc/nixos/user-sessions-not-after-network.patch ]; });

but while it works in configuration.nix, it fails silently in boot/systemd.nix, seemingly ignoring it completely:

$ sudo nixos-rebuild switch -I ~/nixpkgs
[sudo] password for craig: 
building Nix...
building the system configuration...
activating the configuration...
setting up /etc...
reloading user units for craig...
setting up tmpfiles
$ systemctl list-dependencies systemd-user-sessions.service --after | grep network.target
● ├─network.target

Is that expected? It seems strange that there's no error at all.

@gloaming
Copy link
Contributor Author

gloaming commented May 8, 2019

Madness, all madness!

I gazed long into the bug, and the bug also gazed into me :)

I apologize for any misunderstanding

That's quite alright, I'm just glad anyone is interested!

wtf @ making everyone wait for network because other folks can't be bothered to use network-online.target if they need connectivity before running. (is this an accurate summary of the situation?)

Re-reading #44524, I think this is pretty much it, actually.
I don't think the concern about IPv4/IPv6 is warranted, since with the current behaviour, services may be relying on network.target, which will already activate as soon as one of the two is ready. Nothing breaks here if we change to network-online.target in both cases.

Not having network disappear out from user services.. seems like the right/appropriate behavior. Please don't casually change that? Is the alternative ordering "remove network interfaces, THEN try to stop user services"?? Because if that's correct that seems rather hostile and likely problematic. At that point we might as well just make step one "cut power to the entire rack" and only afterwards signal user services... ;).

Haha! If I understand correctly (certainly this is how it's documented - see the manpage), systemd-user-sessions.service does not actually have anything much to do with sessions in the systemd sense, or with user services at all, but it merely needs to run to allow user logins.

This is why I referred to the upstream change as a hack - it only happens to work for SSH because disabling logins with PAM will forcibly close the inbound SSH sessions. Instead, systemd should have a principled way to terminate those sessions either 1) when sshd.service is stopped or 2) before network-online.target is unreached.

It's true, I think, that killing user logins will kill any services running under users who are not set to linger. However, systemd-user-sessions.service is ordered before the user's scope, and I am pretty sure that systemd must terminate the user's processes and services before closing their scope!

$ systemctl list-dependencies systemd-user-sessions.service --before
systemd-user-sessions.service
● ├─getty@tty1.service
● ├─session-1.scope
● ├─systemd-ask-password-wall.service
● ├─user@1000.service

I don't understand why "waiting for dhcpcd to fork" is somehow a significant delay

It waits to acquire a lease, so that depends how fast your DHCP server is ;) it may take several seconds even if nothing is wrong. By default, dhcpcd is more cautious than networkd. More importantly, if you have no network connection, then you must wait for dhcpcd to timeout!

add a dependency on network-online,target from cloud-init?

I think this particular fix is already in place.

Thanks for working on this, hopefully we can get this sorted soon. See also #60954 , I'm working on related issues as well.

Cheers mate, keep fighting the good fight!

@gloaming
Copy link
Contributor Author

gloaming commented May 8, 2019

I'm really not in favour of increasing our amount of patches on top of systemd.

Hah, yes, I see what you mean. Though unless I'm missing something, the existing ones are Debian's patches, not ours. Running patchPhase against systemd/master, it appears that most of them are outdated, so should this be a concern going forward?

(Actually, I don't understand this at all. If we are maintaining our own branches of systemd, why are we pulling patches from other people??)

Reading up on #60900 and #44524 (comment):

It seems adding dhcpcd.service to network.target is wrong, it should be requiredBy=network-online.target

Agreed, for sure! However...

I would like to clarify that this change is not merely a workaround for a bad interaction with dhcpcd. The upstream config is a bug, regardless of whether it takes 30 seconds or 30 milliseconds to reach network.target, because user logins shouldn't have to wait for network adapters. It's semantically wrong, and I intend to file an issue upstream also :)

@flokli
Copy link
Contributor

flokli commented May 8, 2019

I'm really not in favour of increasing our amount of patches on top of systemd.

Hah, yes, I see what you mean. Though unless I'm missing something, the existing ones are Debian's patches, not ours. Running patchPhase against systemd/master, it appears that most of them are outdated, so should this be a concern going forward?

(Actually, I don't understand this at all. If we are maintaining our own branches of systemd, why are we pulling patches from other people??)

These branches add some NixOS-specific things on top of systemd, but while doing so, also accidentially break systemd-tmpfiles - see #56265 (comment) for that discussion

Reading up on #60900 and #44524 (comment):
It seems adding dhcpcd.service to network.target is wrong, it should be requiredBy=network-online.target

Agreed, for sure! However...

I would like to clarify that this change is not merely a workaround for a bad interaction with dhcpcd. The upstream config is a bug, regardless of whether it takes 30 seconds or 30 milliseconds to reach network.target, because user logins shouldn't have to wait for network adapters. It's semantically wrong, and I intend to file an issue upstream also :)

Yes, please engage upstream in the discussion, that's where these things should be solved :-)

@flokli
Copy link
Contributor

flokli commented Jun 15, 2019

@gloaming can this be rebased and rechecked across current master (contains systemd 242)? Did you check back with upstream on how this should be done?

Revert upstream commit 8c85680
This was causing delayed logins at boot waiting on dhcpcd.
Fixes NixOS#60900
@gloaming gloaming force-pushed the user-sessions-not-after-network branch from 6961847 to 3b0a8c6 Compare June 18, 2019 23:48
@gloaming
Copy link
Contributor Author

Rebased and tested locally using systemd.package again. I got hit by #31540, but that appears to be unrelated to this change. Otherwise everything seems OK!

@flokli I haven't contacted upstream yet, I'll try to get that done in the next couple of days.

I realised why the overrideDerivation in boot/systemd.nix didn't work - I was doing -I ~/nixpkgs instead of -I nixpkgs=~/nixpkgs. Oops! I'll look into it again. Would it make sense to add some foolproofing around that, i.e. warn if the added path is a nixpkgs tree?

@gloaming
Copy link
Contributor Author

I filed a PR at NixOS/systemd#28, which seems more logical than adding a patch in Nixpkgs.

@flokli
Copy link
Contributor

flokli commented Jun 20, 2019

@gloaming we should contact upstream (github.com/systemd/systemd) first, before adding custom patches on our fork of systemd.

@gloaming
Copy link
Contributor Author

Upstream issue filed at systemd/systemd#12853

@flokli
Copy link
Contributor

flokli commented Jul 9, 2019

@gloaming reading into nixos/modules/services/networking/dhcpcd.nix, I think the right way to fix this would be changing dhcpcd.service to be before = ["network-online.target"], not ["network.target"].

We fixed the cloud-init problem descibed at 48f7778 in 57840db by setting wants = ["network-online.target"], so network.target (and systemd-user-sessions too) should be reached quite quickly.

Can you check if changing dhcpcd.service works for you?

cc @vincentbernat

@gloaming
Copy link
Contributor Author

gloaming commented Jul 10, 2019

@flokli Yes, we should do that for sure. I still think upstream should fix the other ordering, but once we fix dhcpcd.service it won't really matter.

I've retested that ordering against this morning's release-19.03 and it looks good. Shall I file a new PR and close this one?

@flokli
Copy link
Contributor

flokli commented Jul 10, 2019 via email

@gloaming
Copy link
Contributor Author

Closed in favour of #64621.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants