Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/network-interfaces-scripted: fix a container networking bug #47252

Merged
merged 1 commit into from Oct 11, 2018

Conversation

xeji
Copy link
Contributor

@xeji xeji commented Sep 23, 2018

Motivation for this change

Fixes #47210: When a bridge interface was reconfigured, running containers using this bridge lost connectivity: restarting network-addresses-brN.service triggered a restart of network-setup.service via a partOf relationship introduced in 07e0c0e. This in turn restarted brN-netdev.service.

The bridge was thus destroyed and recreated with the same name but a new interface id, causing attached veth interfaces to lose their connection.

This change removes the partOf relationship between network-setup.service and network-addresses-brN.service for all bridges so they can be reconfigured without being destroyed and recreated.

I don't see any negative side effects atm, but would kindly ask the the authors of of 07e0c0e (which introduced the bug while fixing NixOS/nixops#640) to have a look: cc @nh2, @domenkozar, @fpletz, @aszlig, @basvandijk

There's also an (uglier) alternative solution without changing the partOf relationships: modify the start/stop scripts to skip deleting and recreating a bridge while there's still a veth interface attached.

Things done

NixOS tests:

  • containers-restart_networking now passes (which failed before due to the bug)
  • all other container tests still pass
  • all networking tests still pass

cc @srhb

When a bridge interface was reconfigured, running containers using
this bridge lost connectivity: restarting network-addresses-brN.service
triggered a restart of network-setup.service via a "partOf" relationship
introduced in 07e0c0e.
This in turn restarted brN-netdev.service.
The bridge was thus destroyed and recreated with the same name but a new
interface id, causing attached veth interfaces to lose their connection.

This change removes the "partOf" relationship between
network-setup.service and network-addresses-brN.service for all bridges.
@xeji
Copy link
Contributor Author

xeji commented Sep 23, 2018

affects #45960, please backport to 18.09.

@nh2
Copy link
Contributor

nh2 commented Sep 23, 2018

I also found that despite our apparent fix, the original problem we intended to fix reappeared: https://github.com/NixOS/nixops/issues/833

We don't know yet why it is.

@xeji
Copy link
Contributor Author

xeji commented Sep 23, 2018

If that's the case, maybe we could just revert 07e0c0e ?

@nh2
Copy link
Contributor

nh2 commented Sep 23, 2018

Well, it did fix some of the occurrences (we verified that), but apparently not all of them.

@nh2
Copy link
Contributor

nh2 commented Sep 23, 2018

Perhaps we'll figure it out for good at NixCon? That'd be awesome.

@xeji
Copy link
Contributor Author

xeji commented Sep 23, 2018

Let's do that 😄

@srhb
Copy link
Contributor

srhb commented Sep 23, 2018

Your analysis sounds and looks very correct, thanks a lot for that @xeji ! I will also happily wait for some more feedback from the original authors. :)

@domenkozar
Copy link
Member

Huh, containers test was failing for one year :)

I admit I don't see the full picture of networking dependencies and what should trigger a restart given changed configuration, but exclusion of bridges does sound like a step forward.

@xeji
Copy link
Contributor Author

xeji commented Oct 11, 2018

I'll merge this now as there were no objections. It's probably not the best possible solution but works for now. We should try to refactor and simplify the networking dependencies eventually.

An alternative would be to just let systemd-networkd handle container networking (it's designed to work well with systemd-nspawn containers out of the box) and drop the scripted setup for containers.

@xeji xeji merged commit e7f67f9 into NixOS:master Oct 11, 2018
@xeji xeji deleted the p/fix-47210 branch October 11, 2018 12:55
@xeji
Copy link
Contributor Author

xeji commented Oct 12, 2018

backported to 18.09: 589d270

@arianvp
Copy link
Member

arianvp commented Nov 11, 2018

@xeji This commit actually causes bridges not to appear after nixos-rebuild switch if they have no preconfigured interfaces that they bridge (which is the case when doing container networking). I think it should be reverted and we need to think of another way to fix this. #50208

@flokli
Copy link
Contributor

flokli commented Jan 2, 2019

Actually, #50208 was a duplicate of #42828, it seems this broke a lot in bridge networking. How should we move forward with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

containers-restart_networking test fails
7 participants