Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/boot: rename network interfaces already in stage 1 #39329

Closed
wants to merge 9 commits into from

Conversation

xeji
Copy link
Contributor

@xeji xeji commented Apr 22, 2018

Motivation for this change

Fix #39069. When networking.usePredictableInterfaceNames = true, rename network interfaces already in stage 1 boot to avoid a race condition between interface renaming and configuration.
This also ensures consistent interface names between stage 1 (initrd) and fully booted system.


###### Things done

<!-- Please check what applies. Note that these are not hard requirements but merely serve as information for reviewers. -->

- [x] Tested using sandboxing ([nix.useSandbox](http://nixos.org/nixos/manual/options.html#opt-nix.useSandbox) on NixOS, or option `build-use-sandbox` in [`nix.conf`](http://nixos.org/nix/manual/#sec-conf-file) on non-NixOS)
- Built on platform(s)
   - [x] NixOS
   - [ ] macOS
   - [ ] other Linux distributions
- [x] Tested via one or more NixOS test(s) if existing and applicable for the change (look inside [nixos/tests](https://github.com/NixOS/nixpkgs/blob/master/nixos/tests))
- [ ] Tested compilation of all pkgs that depend on this change using `nix-shell -p nox --run "nox-review wip"`
- [ ] Tested execution of all binary files (usually in `./result/bin/`)
- [ ] Fits [CONTRIBUTING.md](https://github.com/NixOS/nixpkgs/blob/master/.github/CONTRIBUTING.md).

---

@oxij
Copy link
Member

oxij commented Apr 22, 2018

Nice. LGTM in principle, but in practice enabling this will break a lot of configs. E.g. I use initrd networking it a bunch of NixOS configs and applying this would break those configs silently until the next reboot, which would fail to boot.

I think NixOS at the very least needs configVersion (like stateVersion) to merge this safely. And it just so happens that ATM I'm trying to push configVersion to NixOS, but it depends on #39048 (and the next step is https://github.com/oxij/nixpkgs/commits/nixos/warn-missing-stateversion, and only then the configVersion patchset).

@xeji
Copy link
Contributor Author

xeji commented Apr 22, 2018 via email

@oxij
Copy link
Member

oxij commented Apr 22, 2018 via email

@xeji
Copy link
Contributor Author

xeji commented Apr 22, 2018 via email

@xeji xeji changed the title nixos/boot: rename network interfaces already in stage 1 [WIP] nixos/boot: rename network interfaces already in stage 1 Apr 22, 2018
@xeji
Copy link
Contributor Author

xeji commented Apr 22, 2018

Non-breaking interim fix in #39340. Marked this as [WIP] until we can safely get it into 18.09.

@globin
Copy link
Member

globin commented Aug 25, 2018

@xeji any updates on this?

also cc release managers @vcunat @samueldr, since this is the real fix to workarounds for predictable interface naming now not only for networkd but also when using dhcpcd (see #45421)
and cc @fpletz who is working on improving networking in general

@xeji
Copy link
Contributor Author

xeji commented Aug 25, 2018

I would like to get this ready now - shouldn't take long, mainly adapt to systemd 239 and test.

Before proceeding we still need to address concerns about breaking existing configs, see discussion above. In the worst case, servers using initrd networking could become unreachable after update.

I see two options:

  1. accept this breakage and document it in the release notes
  2. make this change conditional on stateVersion - safer but leads to uglier code. (@oxij mentioned a configVersion mechanism but I don't think that has been implemented yet)

How should be proceed here?

@samueldr samueldr added this to the 18.09 milestone Aug 25, 2018
@samueldr
Copy link
Member

In both cases we need to document this in the release notes. The choice is now:

  1. Accept this breakage
  2. Make conditional on stateVersion

As this realistically could break boot, without resort (hey, it's been a week of uptime, I nix-collect-garbage -d already my older boot options), I think gating behind stateVersion would be better.

At least the change isn't too invasive, and if one would like to keep stateVersion < 18.09, they could realistically apply the same change locally if the predictable naming is desired; ${config.boot.initrd.extraUdevRulesCommands} even looks like a workable place to put the fix locally.

@xeji
Copy link
Contributor Author

xeji commented Aug 25, 2018

Made the change conditional on stateVersion and added documentation. First tests look ok, but needs more testing.

@xeji
Copy link
Contributor Author

xeji commented Aug 25, 2018

We could go a step further and dump our old non-standard 80-net-setup-link.rules in favor of the standard systemd udev rule, which uses an udev builtin to rename interfaces to predictable names based on the net.ifnames kernel parameter. Then usePredictableInterfaceNames becomes essentially an alias for that kernel parameter. But this will make it difficult to support the previous behavior for older configs, so I think we should do that maybe in 19.03 when everyone has had a chance to update their configs.

for changed behavior when stateVersion >= 18.09
- test new implementation with system.stateVersion = 18.09
- enable initrd networking to see interface names in logs
not necessary if stateVersion >= 18.09
as devices are already renamed in stage 1
if networking.usePredictableInterfaceNames=true
and system.stateVersion >= 18.09
because now interfaces are already renamed in stage 1
@GrahamcOfBorg
Copy link

Timed out, unknown build status on aarch64-linux (full log)

Attempted: tests.initrd-network-ssh.predictable

Partial log (click to expand)

client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
building of '/nix/store/081zagf5asipv62a26vwjylhjykzr4wz-vm-test-run-initrd-network-sshpredictable.drv' timed out after 3600 seconds
error: build of '/nix/store/081zagf5asipv62a26vwjylhjykzr4wz-vm-test-run-initrd-network-sshpredictable.drv' failed

@samueldr
Copy link
Member

samueldr commented Aug 27, 2018

Running on the aarch64 community box I am seeing issues, and they seem consistent.

Partial output of nix-build nixos/tests/initrd-network-ssh/default.nix -A predictable
server#
server# <<< NixOS Stage 1 >>>
server#
server# loading module virtio_balloon...
server# modprobe: FATAL: Module virtio_balloon not found in directory /lib/modules/4.14.66
server# loading module virtio_console...
server# modprobe: FATAL: Module virtio_console not found in directory /lib/modules/4.14.66
server# loading module virtio_rng...
server# [    1.824348] random: fast init done
server# [    1.828758] random: crng init done
server# loading module dm_mod...
server# [    1.850868] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: dm-devel@redhat.com
server# loading module af_packet...
server# modprobe: FATAL: Module af_packet not found in directory /lib/modules/4.14.66
server# running udev...
server# [    1.886406] systemd-udevd[141]: starting version 239
server# [    1.898005] usb 1-1: new high-speed USB device number 2 using ehci-pci
server# [    2.033234] virtio_net virtio0 enp0s1: renamed from eth0
server# [    2.075432] input: QEMU QEMU USB Keyboard as /devices/platform/3f000000.pcie/pci0000:00/0000:00:08.0/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input/input1
server# [    2.086893] virtio_net virtio1 enp0s2: renamed from eth1
server# [    2.162322] hid-generic 0003:0627:0001.0001: input: USB HID v1.11 Keyboard [QEMU QEMU USB Keyboard] on usb-0000:00:08.0-1/input0
server# kbd_mode: KDSKBMODE: Inappropriate ioctl for device
server# %Gipconfig: ens4: SIOCGIFINDEX: No such device
server# ipconfig: no devices to configure
server# bringing up network interface enp0s1...
server# bringing up network interface enp0s2...
server# bringing up network interface lo...
server# [    2.306066] usb 1-2: new high-speed USB device number 3 using ehci-pci
server# acquiring IP address via DHCP...
server# udhcpc: SIOCGIFINDEX: No such device
server# [    2.473022] input: QEMU QEMU USB Tablet as /devices/platform/3f000000.pcie/pci0000:00/0000:00:08.0/usb1/1-2/1-2:1.0/0003:0627:0001.0002/input/input2
server# [    2.489920] hid-generic 0003:0627:0001.0002: input: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:08.0-2/input0
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
client: exit status 1
client: running command: ping -c 1 server
error: interrupted by the user

-A unpredictable is fine though.


I got a "complete" output here, though unsurprisingly it matches with ofborg's output (and IIRC they do run on the same hardware). So it's not strictly because of ofborg, but either the configuration of the specific machine or some random luck factor?

@xeji
Copy link
Contributor Author

xeji commented Aug 27, 2018

I'm speculating, but this may be caused by older versions of NixOS/nix/qemu on different builders.

I will try to make the predictable-interface-names test more robust by grepping for a pattern instead of a hardcoded interface name. That should work around this issue and fix it on aarch64.

The initrd-networking-ssh.predictable test cannot be fixed this way because the expression must hardcode the interface name to start initrd networking. It will always fail on builders where that name is different. But that's a separate issue regarding uniform setup of our CI infrastructure and/or reproducible results of our qemu-based test framework - out of scope of this PR.

@edolstra
Copy link
Member

This should not depend on stateVersion because that's not what stateVersion is for. stateVersion is for cases where there is mutable state on disk (like PostgreSQL databases) that cannot be migrated easily. It does not mean "change option defaults to whatever was the case in NixOS version ".

@xeji
Copy link
Contributor Author

xeji commented Aug 27, 2018

@edolstra then what mechanism do you suggest for preventing breakage of existing configs?

@samueldr
Copy link
Member

samueldr commented Sep 2, 2018

@edolstra other than stateVersion what can be used now to prevent breakage from this change?

Though, maybe in the end there are no risks? Would the system collect the currently booted configuration? There should be references still existing keeping it alive, right? Still annoying to have a broken boot though.

@xeji
Copy link
Contributor Author

xeji commented Sep 2, 2018

This isn't ready yet, the tests still needs a little fine tuning (see above discussion). I suspended working on it until there's consensus on stateVersion or another approach.

@mmahut
Copy link
Member

mmahut commented Aug 7, 2019

Any updates on this pull request, please?

@matthewbauer matthewbauer modified the milestones: 19.03, 20.03 Aug 28, 2019
@emilazy
Copy link
Member

emilazy commented Sep 16, 2019

The nixos/modules/services/hardware/80-net-setup-link.rules file that this copies to initrd has been deleted as of 1f03f6f; what's the correct way forward to have this behaviour in initrd now that it has apparently been obsoleted? The commit message claims that the .rules files are now obsolete and all the renaming is done in C in udev, but my initrd seems to still use eth0 and then the device renaming fails after boot:

Sep 16 14:04:39 patchouli systemd-udevd[1680]: eth0: Failed to rename network interface 2 from 'eth0' to 'enp4s0': Device or resource busy

cc @xeji @andir (author of the commit) @erikarvstedt (author of similar PR #47664)

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/predictable-network-interface-names-in-initrd/4055/1

arianvp added a commit to arianvp/nixpkgs that referenced this pull request Sep 17, 2019
Otherwise we run into issues that udev cannot rename the interface in
stage-2 because the device is busy.

See :
* NixOS#39329
* https://discourse.nixos.org/t/predictable-network-interface-names-in-initrd/4055
* NixOS@1f03f6f
wizeman pushed a commit to wizeman/nixpkgs that referenced this pull request Oct 11, 2019
Otherwise we run into issues that udev cannot rename the interface in
stage-2 because the device is busy.

See :
* NixOS#39329
* https://discourse.nixos.org/t/predictable-network-interface-names-in-initrd/4055
* NixOS@1f03f6f
@disassembler disassembler modified the milestones: 20.03, 20.09 Feb 5, 2020
@flokli
Copy link
Contributor

flokli commented Mar 27, 2020

I think this fix is included in #79532. @xeji, @oxij can you take a look?

@stale
Copy link

stale bot commented Sep 23, 2020

Hello, I'm a bot and I thank you in the name of the community for your contributions.

Nixpkgs is a busy repository, and unfortunately sometimes PRs get left behind for too long. Nevertheless, we'd like to help committers reach the PRs that are still important. This PR has had no activity for 180 days, and so I marked it as stale, but you can rest assured it will never be closed by a non-human.

If this is still important to you and you'd like to remove the stale label, we ask that you leave a comment. Your comment can be as simple as "still important to me". But there's a bit more you can do:

If you received an approval by an unprivileged maintainer and you are just waiting for a merge, you can @ mention someone with merge permissions and ask them to help. You might be able to find someone relevant by using Git blame on the relevant files, or via GitHub's web interface. You can see if someone's a member of the nixpkgs-committers team, by hovering with the mouse over their username on the web interface, or by searching them directly on the list.

If your PR wasn't reviewed at all, it might help to find someone who's perhaps a user of the package or module you are changing, or alternatively, ask once more for a review by the maintainer of the package/module this is about. If you don't know any, you can use Git blame on the relevant files, or GitHub's web interface to find someone who touched the relevant files in the past.

If your PR has had reviews and nevertheless got stale, make sure you've responded to all of the reviewer's requests / questions. Usually when PR authors show responsibility and dedication, reviewers (privileged or not) show dedication as well. If you've pushed a change, it's possible the reviewer wasn't notified about your push via email, so you can always officially request them for a review, or just @ mention them and say you've addressed their comments.

Lastly, you can always ask for help at our Discourse Forum, or more specifically, at this thread or at #nixos' IRC channel.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Sep 23, 2020
@flokli
Copy link
Contributor

flokli commented Sep 23, 2020

No reaction after #39329 (comment), and this is probably fixed, so closing.

@flokli flokli closed this Sep 23, 2020
@xeji xeji deleted the fix-39069 branch February 17, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nixos/networkd: fails non-deterministically with predictable interface names