Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/systemd.nix: don’t require online for multi-user.target #86273

Merged

Conversation

matthewbauer
Copy link
Member

@matthewbauer matthewbauer commented Apr 29, 2020

Not all systems need to be online to boot up. So, don’t pull
network-online.target into multi-user.target. Services that need
online network can still require it.

This decreases my boot time from ~9s to ~5s.

Motivation for this change
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Not all systems need to be online to boot up. So, don’t pull
network-online.target into multi-user.target. Services that need
online network can still require it.

This increases my boot time from ~9s to ~5s.
@worldofpeace
Copy link
Contributor

This increases my boot time from ~9s to ~5s.

I think you need to reword this to decreases.

I wonder how many services will get a changed behavior (undesirable) from online not being requried for multi-user.target.

@worldofpeace
Copy link
Contributor

Confirmed my boot was faster with systemd-analyze.

@peterhoeg
Copy link
Member

I wonder how many services will get a changed behavior (undesirable) from online not being requried for multi-user.target.

Some might, but then those services have to be fixed with an explicit dependency. @matthewbauer is doing the right thing here.

@Ma27 Ma27 requested review from flokli and andir April 29, 2020 17:28
Copy link
Member

@lovesegfault lovesegfault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decreased my boot time too

@peterhoeg peterhoeg merged commit 0ae7a68 into NixOS:master Apr 30, 2020
@matthewbauer
Copy link
Member Author

matthewbauer commented May 1, 2020

It looks like geoclue2 may have an issue with this:

Failed to query location: Error resolving “location.services.mozilla.com”: Name or service not known

I think this can be fixed by a wants = [ "network.target" ] in geoclue-agent.

@peterhoeg
Copy link
Member

multi-user.target should pull in network.target so it shouldn't be necessary. This fix is actually wrong. It should instead have been: systemd.targets.network.wantedBy = [ "multi-user.target" ];

@flokli
Copy link
Contributor

flokli commented May 1, 2020

multi-user.target should pull in network.target so it shouldn't be necessary. This fix is actually wrong. It should instead have been: systemd.targets.network.wantedBy = [ "multi-user.target" ];

I don't think so. geoclue2 strictly requires the network to be up, as it stumbles over its feet when doing a DNS lookup on startup. The network-online.target is what should be used in this cases, via Wants=network-online.target - please refer to https://www.freedesktop.org/software/systemd/man/systemd.special.html#network-online.target.

If this is added to geoclue2.service, it'll pull in network-online.target, which will pull in network.target, and if geoclue2.service is part of multi-user.target, multi-user.target will depend on the network-related targets transitively.

We shouldn't manually add network*.target units to multi-user.target, so a NAK on this one.

@andir
Copy link
Member

andir commented May 1, 2020

This broke a bazillion tests and potentially even more services.
I would like to ask for a bit more caution when doing these kinds of changes and not just looking at the seconds it takes to boot your local system. We can probably get that to zero by just removing all of your services.

@flokli and I will start looking into that now.

@worldofpeace
Copy link
Contributor

@andir I was thinking the same thing with

I wonder how many services will get a changed behavior (undesirable) from online not being requried for multi-user.target.

TBH, I think reverts are more peaceful on master and then recombine the changes.

@flokli
Copy link
Contributor

flokli commented May 1, 2020

Reverted in 15d761a:

    Revert "nixos/systemd.nix: don’t require online for multi-user.target"
    
    This reverts commit 764c8203b833176d546395a5c1adf193a9ca73f8.
    
    While this is desireable in principle, some of our modules and services
    fail during service startup if no network is available don't currently
    properly set Wants=network-online.target.
    
    If nothing pulls in this target anymore, systemd won't try to reach it.
    
    We have many VM tests waiting for `network-online.target`, and after
    764c8203b833176d546395a5c1adf193a9ca73f8 fail with the following error
    message:
    
    ```
    error: unit "network-online.target" is inactive and there are no pending jobs
    ```
    
    Most likely, test scripts shouldn't wait for `network-online.target` in
    first place (as `network-online.target` says nothing about whether a
    service has been started), but instead, the script should wait for the
    network ports of the corresponding service to be open.
    
    Let's revert this for now, and re-apply in a draft PR, fixing the tests
    before merging it back in.

@peterhoeg
Copy link
Member

We shouldn't manually add network*.target units to multi-user.target, so a NAK on this one.

Yep, you're right. I checked upstream and they don't pull it in either.

@peterhoeg
Copy link
Member

not just looking at the seconds it takes to boot your local system

IMHO, this is not what this is about. The main issue as I see it, is that while we can use upstream's units, the [Install] section is not used and therefore has to be specified in nixos, so over time we slowly drifting away from what upstream recommends. In this case, there isn't one (geoclue is dbus activated) but the general principle still stands.

@flokli
Copy link
Contributor

flokli commented May 5, 2020

yeah, I agree it's not ideal, but I also don't see a nicer way to solve this currently.

@gloaming
Copy link
Contributor

gloaming commented Aug 6, 2020

I'm completely baffled. Of course multi-user.target should want networking; that means simply that to have a basic, usable system (not necessarily a graphical one), the network should be brought online. That's a sane default.

If some systems (or nixos tests) desire a different configuration, they should override it.

The main issue as I see it, is that while we can use upstream's units, the [Install] section is not used and therefore has to be specified in nixos, so over time we slowly drifting away from what upstream recommends.

I would be astonished if systemd upstream are recommending that we distribute to our users an operating system distribution that doesn't enable networking by default.

Searching their unit files, I can't find network-online.target, and the docs suggest it's not on by default (huh), but multi-user.target does transitively want network.target: https://github.com/systemd/systemd/blob/cabc1c6d7adae658a2966a4b02a6faabb803e92b/units/systemd-networkd.service.in

The question of whether it should want network.target or network-online.target is, as far as I can tell, a matter of style, since services required to bring up the network should* be wanted by network.target, and any service that orders after network-online.target should also want network-online.target. (If they don't they should be fixed!)

*Well, it looks like that's how their setup works. dhcpcd needs to order after network.target because it needs to wait for wireless. So it wouldn't work for us; that is to say, we aren't using systemd-networkd by default, so the unit ordering they ship is moot.

This decreases my boot time from ~9s to ~5s.

How is everyone measuring this? The output of systemd-analyze? If so, all that's telling you is that it takes less time to reach a smaller set of requirements. Those requirements aren't necessarily making you wait. If your stopwatch is telling you that there's less wall time from pressing the power button to getting your desktop, that's what counts.

The important question is: what services are ordered after network-online.target?
If your desktop session is not here, then your desktop session is not waiting for it.

$ systemctl list-dependencies network-online.target --before 
network-online.target
● ├─noip.service
● ├─multi-user.target
● │ ├─graphical.target
● │ │ └─shutdown.target
● │ └─shutdown.target
● └─shutdown.target

@matthewbauer
Copy link
Member Author

matthewbauer commented Aug 6, 2020

@gloaming On most systems, you shouldn't have to connect to a network just to reach the login screen. You still need to have the network interfaces up (network.target) though, but that's not what this PR did (network-online.target). If multi-user.target automatically implies network-online.target, what's the point of having network-online.target? From systemd docs:

network-online.target
Note the distinction between this unit and network.target. This unit is an active unit (i.e. pulled in by the consumer rather than the provider of this functionality) and pulls in a service which possibly adds substantial delays to further execution. In contrast, network.target is a passive unit (i.e. pulled in by the provider of the functionality, rather than the consumer) that usually does not delay execution much. Usually, network.target is part of the boot of most systems, while network-online.target is not, except when at least one unit requires it. Also see Running Services After the Network is up for more information.

https://www.freedesktop.org/software/systemd/man/systemd.special.html#network-online.target

Of course, you might still have to connect to the network later on, but that shouldn't block starting X11 / Wayland. 4s on my system might not seem like a lot, but on some worse network cards (like found on rpi0w), or worse connections, this can easily be 10+.

Note that this PR was reverted, since it broke tests which incorrectly assume multi-user.target = network-online.target.

@gloaming
Copy link
Contributor

gloaming commented Aug 6, 2020

Yeah, it was reverted but there's another PR to bring it back (#86484) so I wanted to raise my concerns in the main discussion here.

Indeed you shouldn't have to wait. But I haven't yet seen any evidence that we are. Setting multi-user.target Wants= network-online.target does not in itself cause the login screen (or anything else) to wait for the network. If you are measuring a real delay here (again, with a clock, not with systemd's numbers) we need to figure out why.

The output of systemd-analyze blame isn't diagnostic here. It shows the complete time taken to reach default.target, but the system can be up and usable minutes before that. Reaching default.target means that all desired services are running, not just those that are required to use the machine.

You want to find the time taken to reach the unit you actually care about, which is the login screen. You could try looking at the output of systemd-analyze critical-chain display-manager.service and systemd-analyze critical-chain systemd-user-sessions.service. That might be more informative.


I think you might be a bit mixed up with requirements and ordering dependencies? The two are completely orthogonal. If you have even a single service that wants network-online, and you want that service in your default.target, then whether multi-user.target wants network-online or does doesn't make any difference. Either a unit is wanted or it isn't. It doesn't matter which units want it or why.

Moreover, a dependency cannot cause a delay; only an ordering can cause a delay. network-online cannot delay X11 unless X11 is ordered after it. On my system that is not the case. If it is on yours, please post the output of the commands I've suggested so we can see why.

this can easily be 10+.

What can? What are you measuring?


Also: It's fine if people want to change the dependency on network-online to a dependency on network, if they like that it gives happy placebo numbers in systemd-analyze blame. Just don't remove it altogether.

@matthewbauer
Copy link
Member Author

Maybe it was just a placebo! I can't get an actual boot up time improvement from this change. I had been using critical-chain (which defaults to default.target, which is definitely not the thing to look at) + approximate timings, and thought it had a noticeable effect. But now, it seems like it's just changing when "multi-user.target" finishes - not SDDM. I may have had some messed up plymouth settings back then that interfered in some way (it might have had after = [ "multi-user.target" ] in plymouth-quit.target).

There may still be an argument for making multi-user.target happen earlier. We don't have many After = multi-user.target in NixOS, but it may be used in some service files out there.

@gloaming
Copy link
Contributor

gloaming commented Aug 8, 2020

Cool :)

Hmm, I can't think of any reason any service should order after multi-user.target... However, I can think of a reason for making it happen earlier: so that the boot time numbers don't confuse people 😁

I guess it depends on the use case - if you're running a server, you might well be interested in the "boot time" until the server is reachable from the outside. But it's not what most people would think of as boot time.

So in conclusion - we should probably have multi-user.target after network.target rather than network-online.target, but we need to be careful before making the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants