Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard: 0.0.20190406 -> 0.0.20190531 and Change peers without tearing down the interface, handle DNS failures better #62325

Merged
merged 3 commits into from May 31, 2019

Conversation

grahamc
Copy link
Member

@grahamc grahamc commented May 31, 2019

Motivation for this change

In #30459 @sjau reported a problem with starting up a system with WireGuard where peers used DNS. On boot, the system would start the WireGuard interface, and attempt to add a peer by hostname. This would fail because the system wasn't able to resolve domains yet.

In #61971 sjau fixed this bug by having the wg startup unit restart until it worked, but this caused @zx2c4 great heartache.

Relatedly, having the entire interface and all peers disappear because one peer failed to resolve is not a nice or acceptable failure mode. This is related to another problem: changing a single peer causes the interface to go away and be restarted from scratch. This isn't necessary, either :)

  • Commit 1 changes the unit to create a systemd unit per peer
  • Commit 2 updates wireguard from 0.0.20190406 -> 0.0.20190531 which adds a flag to fix the original bug report -- by letting resolution retry forever
  • Commit 3 takes advantage of that flag

cc @Mic92

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Before, changing any peers caused the entire WireGuard interface to
be torn down and rebuilt. By configuring each peer in a separate
service we're able to only restart the affected peers.

Adding each peer individually also means individual peer
configurations can fail, but the overall interface and all other peers
will still be added.

A WireGuard peer's internal identifier is its public key. This means
it is the only reliable identifier to use for the systemd service.
@@ -265,25 +316,8 @@ let
wg set ${name} private-key ${privKey} ${
optionalString (values.listenPort != null) " listen-port ${toString values.listenPort}"}

${concatMapStringsSep "\n" (peer:
assert (peer.presharedKeyFile == null) || (peer.presharedKey == null); # at most one of the two must be set
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert has been moved down in to the assertions, which provides a reasonable error message than "assertion failed" :P

@grahamc
Copy link
Member Author

grahamc commented May 31, 2019

@GrahamcOfBorg build wireguard-tools
@GrahamcOfBorg test wireguard
@GrahamcOfBorg test wireguard-generated

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Also switched over my system to this PR.

@flokli flokli merged commit b3dd39c into NixOS:master May 31, 2019
@grahamc grahamc deleted the wireguard-master branch May 31, 2019 22:31
@flokli
Copy link
Contributor

flokli commented May 31, 2019

We probably either want to revert ced7cfc in 19.03, or backport this PR to 19.03. I guess doing the former makes more sense.

@Mic92
Copy link
Member

Mic92 commented Jun 1, 2019

I think backporting all wireguard snapshots makes more sense. This is what I used to do in the past.

@flokli
Copy link
Contributor

flokli commented Jun 1, 2019 via email

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 1, 2019

Sorry guys, you might want to bump again to 0.0.20190601. 0531 has some issues on Intel Atom.

grahamc added a commit to grahamc/nixpkgs that referenced this pull request Jun 1, 2019
@grahamc grahamc mentioned this pull request Jun 1, 2019
10 tasks
Copy link
Member

@Ma27 Ma27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late response, just had a deeper look at this and IMHO this is way better now, so thanks a lot @grahamc @zx2c4!

let
keyToUnitName = replaceChars
[ "/" "-" " " "+" "=" ]
[ "-" "\\x2d" "\\x20" "\\x2b" "\\x3d" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you didn't use utils.escapeSystemdPath here?

@sjau
Copy link

sjau commented Jun 3, 2019

Unfortunately that seems to prevent to start client interfaces properly. Having it applied, it did properly start the server interface, but several client interfaces do show up as started as well but there's no ping/connection possible.

wg show does show it as up.
ip addr show does list the interface
But when pinging the wg server through that wg client interface results in nothing and also ssh doesn't work.

interface: wg_hb
  public key: [....]
  private key: (hidden)
  listening port: 38581
7: wg_hb: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.100.200.2/32 scope global wg_hb
       valid_lft forever preferred_lft forever
● wireguard-wg_hb.service - WireGuard Tunnel - wg_hb
   Loaded: loaded (/nix/store/lamj7yglpzp1h83yx7yxb5nnvzq4qivh-unit-wireguard-wg_hb.service/wireguard-wg_hb.service; enabled; vendor preset: enabled)
   Active: active (exited) since Mon 2019-06-03 06:39:16 CEST; 12min ago
  Process: 4243 ExecStart=/nix/store/4gjzpinpamy8zs0xdbnmdafhvd7hrfpm-unit-script-wireguard-wg_hb-start (code=exited, status=0/SUCCESS)
 Main PID: 4243 (code=exited, status=0/SUCCESS)

Jun 03 06:39:13 servi systemd[1]: Starting WireGuard Tunnel - wg_hb...
Jun 03 06:39:16 servi systemd[1]: Started WireGuard Tunnel - wg_hb.
ping 10.100.200.7 
PING 10.100.200.7 (10.100.200.7) 56(84) bytes of data. ^C               
--- 10.100.200.7 ping statistics ---                                                                                         
10 packets transmitted, 0 received, 100% packet loss, time 9227ms

And on the server side:

interface: wg_hb
  public key: [...]
  private key: (hidden)
  listening port: 51820

[...]

peer: [...]
  endpoint: [...]:51115
  allowed ips: 10.100.200.2/32
  latest handshake: 1 hour, 11 minutes, 53 seconds ago
  transfer: 39.62 MiB received, 2.60 GiB sent

That was before I run nixos-rebuild boot --upgrade with Unstable Small channel and then did reboot.

Edit:

It seems the problem is with the generated unit file. If it's a wg client, there doesn't seem any reference to the wg server in it:

/nix/store/4gjzpinpamy8zs0xdbnmdafhvd7hrfpm-unit-script-wireguard-wg_hb-start

#! /nix/store/x0mlaj4z4ciycaycfwc36l1932mwywfj-bash-4.4-p23/bin/bash -e
modprobe wireguard



ip link add dev wg_hb type wireguard

ip address add 10.100.200.2/32 dev wg_hb

wg set wg_hb private-key /nix/store/83pnx2jmkdx6y0qy28c2glzhbypcgkjf-wg-key 

ip link set up dev wg_hb

On a different machine, that isn't upgraded yet to latest code, the generated unit file looks like this:

#! /nix/store/x0mlaj4z4ciycaycfwc36l1932mwywfj-bash-4.4-p23/bin/bash -e
modprobe wireguard



ip link add dev wg_home type wireguard

ip address add 10.200.0.2/32 dev wg_home

wg set wg_home private-key /nix/store/8z5f4b0b9rdwjzzga1i0015m0ni7a1xx-wg-key 

wg set wg_home peer [...] endpoint [....]:51820 persistent-keepalive 25 allowed-ips 10.200.0.0/24

ip link set up dev wg_home

ip route replace 10.200.0.0/24 dev wg_home table main

So, the difference is that wg set ... peer .... and ìp route .... are missing.

@sjau
Copy link

sjau commented Jun 3, 2019

It actually seems the two commands wg set ... peer .... and ip route ... are created but in a peer unit file and that unit file isn't properly started. See:

[08:56] <clever> have you tried restarting the peer unit?
[08:56] <hyper_ch2> systemctl restart wireguard-wg_ons --> same result
[08:57] <clever> the peer unit, not that one
[08:57] <hyper_ch2> I have no idea how to do that
[08:57] <clever> systemctl restart wireguard-wg_ons-peer-enDIyyZaperJVAk-GxsTEHx-VlxCEZ9PM1uCvoO-Km8\x3d.service
[08:57] <clever> didnt copy the whole thing, oops
[08:58] <hyper_ch2> Failed to restart wireguard-wg_ons-peer-enDIyyZaperJVAk-GxsTEHx-VlxCEZ9PM1uCvoO-Km8x3d.service: Unit wireguard-wg_ons-peer-enDIyyZaperJVAk-GxsTEHx-VlxCEZ9PM1uCvoO-Km8x3d.service not found.
[08:58] <hyper_ch2> the backslash in there is really, really annoying
[08:58] <clever> try \\
[08:58] <clever> and i agree
[08:59] <hyper_ch2> clever: now it works
[...]
[08:59] <hyper_ch2> still, after reboot, it doesn't work anymore
[09:00] <clever> hyper_ch2: sounds like the peer units just need more wantedby flags, which would be a bug
[...]
[09:08] <hyper_ch2> clever: so, what's the solution?
[09:09] <clever> hyper_ch2: the peer units likely need to be improved, to auto-run on bootup

@sjau
Copy link

sjau commented Jun 3, 2019

It seem the peer unit files are properly created - but not called by the current unit files:

cd /etc/systemd/sytem
ls -l *wireguard*

results in:

wireguard-wg_hb-peer-quHqlmRZb2J\x2bg2V5GgOXqERaVLYjx5vAUMgYLIgqo2U\x3d.service
wireguard-wg_hb.service
wireguard-wg_home-peer-aSzrILnaPQJ5lIMekIvGi6tsTQqgluazvhcjcwFQsFg\x3d.service
wireguard-wg_home-peer-b7m83\x2bqXz1LAGPVRrWeJ0c0AQYEKRb4VL67w1OOliVw\x3d.service
wireguard-wg_home-peer-CXgaSD\x2b4zhH2aHwCxMsAeCQqkWsz\x2bf95m10bWeBhNA8\x3d.service
wireguard-wg_home-peer-fUKlG8Rb0eHSUJYVbkHtrWuX7\x2b6XM0u\x2b5NLmygCcVBA\x3d.service
wireguard-wg_home-peer--LaoVQ2sjnk66d2RQiOn1ImMBxcMDnz7o1Virq2hLRk\x3d.service
wireguard-wg_home-peer-lcYd1Q2yX-UHW08fqI-AWJLK-wUzj5meJ0q3F8zlQiE\x3d.service
wireguard-wg_home-peer-pRuFE4CYInp7VHQre66I18brnJTKye\x2bYDXS8iQF-zDc\x3d.service
wireguard-wg_home.service
wireguard-wg_jl-peer-pRuFE4CYInp7VHQre66I18brnJTKye\x2bYDXS8iQF-zDc\x3d.service
wireguard-wg_jl.service
wireguard-wg_ons-peer-enDIyyZaperJVAk-GxsTEHx-VlxCEZ9PM1uCvoO-Km8\x3d.service
wireguard-wg_ons.service

Looking at wireguard-wg_hb.service it returns:

[Unit]
After=network.target network-online.target
Description=WireGuard Tunnel - wg_hb
Requires=network-online.target

[Service]
Environment="DEVICE=wg_hb"
Environment="LOCALE_ARCHIVE=/nix/store/wzrv1xk177p4zazzv26288c936375c23-glibc-locales-2.27/lib/locale/locale-archive"
Environment="PATH=/nix/store/yi7zbr0fs6622j56iwfi4446cg7k2f5j-kmod-26/bin:/nix/store/awszgbq1w4fqkg4slkl95544sg20v50v-iproute2-5.1.0/bin:/nix/store/a45vrxj0rybi9pagdx914j4wdpvpz480-wireguard-tools-0.0.20190601/bin:/nix/store/bv1lw6a2kw0mn2y3lxhi43180idx6sp9-coreutils-8.31/bin:/nix/store/s1n4vl1f3in3nacalrc3xam0vyzpsfvs-findutils-4.6.0/bin:/nix/store/7d9bi31h40hky30f5scqx7r6wn311ain-gnugrep-3.3/bin:/nix/store/qg4qbkbca7qapfzpa8p991yjf944fc3w-gnused-4.7/bin:/nix/store/v5zabkifl190b69lp92wsssn8zdp8370-systemd-239.20190219/bin:/nix/store/yi7zbr0fs6622j56iwfi4446cg7k2f5j-kmod-26/sbin:/nix/store/awszgbq1w4fqkg4slkl95544sg20v50v-iproute2-5.1.0/sbin:/nix/store/a45vrxj0rybi9pagdx914j4wdpvpz480-wireguard-tools-0.0.20190601/sbin:/nix/store/bv1lw6a2kw0mn2y3lxhi43180idx6sp9-coreutils-8.31/sbin:/nix/store/s1n4vl1f3in3nacalrc3xam0vyzpsfvs-findutils-4.6.0/sbin:/nix/store/7d9bi31h40hky30f5scqx7r6wn311ain-gnugrep-3.3/sbin:/nix/store/qg4qbkbca7qapfzpa8p991yjf944fc3w-gnused-4.7/sbin:/nix/store/v5zabkifl190b69lp92wsssn8zdp8370-systemd-239.20190219/sbin"
Environment="TZDIR=/nix/store/62wblqsm4pfs5yqsi92g82lwmspgyprh-tzdata-2019a/share/zoneinfo"



ExecStart=/nix/store/4gjzpinpamy8zs0xdbnmdafhvd7hrfpm-unit-script-wireguard-wg_hb-start 
ExecStopPost=/nix/store/2a0jyj2zkgr7gllr63mmxd44n08v5034-unit-script-wireguard-wg_hb-post-stop
RemainAfterExit=true
Type=oneshot

Looking at the ExecStart file /nix/store/4gjzpinpamy8zs0xdbnmdafhvd7hrfpm-unit-script-wireguard-wg_hb-start, we'll get this output:

#! /nix/store/x0mlaj4z4ciycaycfwc36l1932mwywfj-bash-4.4-p23/bin/bash -e
modprobe wireguard



ip link add dev wg_hb type wireguard

ip address add 10.100.200.2/32 dev wg_hb

wg set wg_hb private-key /nix/store/83pnx2jmkdx6y0qy28c2glzhbypcgkjf-wg-key 

ip link set up dev wg_hb

but as said before, two instructions for the peer (in this case the server peer) are missing. However, in /etc/systemd/system we also have the peer unit file for the wg_hb interface: wireguard-wg_hb-peer-quHqlmRZb2J\x2bg2V5GgOXqERaVLYjx5vAUMgYLIgqo2U\x3d.service with this content:

[Unit]
After=wireguard-wg_hb.service
Description=WireGuard Peer - wg_hb - [...]
Requires=wireguard-wg_hb.service

[Service]
Environment="DEVICE=wg_hb"
Environment="LOCALE_ARCHIVE=/nix/store/wzrv1xk177p4zazzv26288c936375c23-glibc-locales-2.27/lib/locale/locale-archive"
Environment="PATH=/nix/store/awszgbq1w4fqkg4slkl95544sg20v50v-iproute2-5.1.0/bin:/nix/store/a45vrxj0rybi9pagdx914j4wdpvpz480-wireguard-tools-0.0.20190601/bin:/nix/store/bv1lw6a2kw0mn2y3lxhi43180idx6sp9-coreutils-8.31/bin:/nix/store/s1n4vl1f3in3nacalrc3xam0vyzpsfvs-findutils-4.6.0/bin:/nix/store/7d9bi31h40hky30f5scqx7r6wn311ain-gnugrep-3.3/bin:/nix/store/qg4qbkbca7qapfzpa8p991yjf944fc3w-gnused-4.7/bin:/nix/store/v5zabkifl190b69lp92wsssn8zdp8370-systemd-239.20190219/bin:/nix/store/awszgbq1w4fqkg4slkl95544sg20v50v-iproute2-5.1.0/sbin:/nix/store/a45vrxj0rybi9pagdx914j4wdpvpz480-wireguard-tools-0.0.20190601/sbin:/nix/store/bv1lw6a2kw0mn2y3lxhi43180idx6sp9-coreutils-8.31/sbin:/nix/store/s1n4vl1f3in3nacalrc3xam0vyzpsfvs-findutils-4.6.0/sbin:/nix/store/7d9bi31h40hky30f5scqx7r6wn311ain-gnugrep-3.3/sbin:/nix/store/qg4qbkbca7qapfzpa8p991yjf944fc3w-gnused-4.7/sbin:/nix/store/v5zabkifl190b69lp92wsssn8zdp8370-systemd-239.20190219/sbin"
Environment="TZDIR=/nix/store/62wblqsm4pfs5yqsi92g82lwmspgyprh-tzdata-2019a/share/zoneinfo"
Environment="WG_ENDPOINT_RESOLUTION_RETRIES=infinity"



ExecStart=/nix/store/nsg799adgzjnahhzch1a3k4r4mp9kkb1-unit-script-wireguard-wg_hb-peer-quHqlmRZb2J--x2bg2V5GgOXqERaVLYjx5vAUMgYLIgqo2U--x3d-start 
ExecStopPost=/nix/store/1pjhjl6xbf40vxkjs4zkligavmrqz0xp-unit-script-wireguard-wg_hb-peer-quHqlmRZb2J--x2bg2V5GgOXqERaVLYjx5vAUMgYLIgqo2U--x3d-post-stop
RemainAfterExit=true
Type=oneshot

Looking there at the ExecStart file /nix/store/nsg799adgzjnahhzch1a3k4r4mp9kkb1-unit-script-wireguard-wg_hb-peer-quHqlmRZb2J--x2bg2V5GgOXqERaVLYjx5vAUMgYLIgqo2U--x3d-start it contains:

#! /nix/store/x0mlaj4z4ciycaycfwc36l1932mwywfj-bash-4.4-p23/bin/bash -e
wg set wg_hb peer [...] endpoint [...]:51820 persistent-keepalive 25 allowed-ips 10.100.200.0/24
ip route replace 10.100.200.0/24 dev wg_hb table main

So that's the missing part. When running those two commands, the tunnel now works.

But no idea where the peer unit file also should be called from

@grahamc
Copy link
Member Author

grahamc commented Jun 3, 2019

I can't reproduce your bug report, @sjau. Please ping me on IRC when you're able to debug further.

@grahamc
Copy link
Member Author

grahamc commented Jun 3, 2019

FWIW here is the test case I wrote to try and reproduce:

import ../make-test.nix ({ lib, pkgs, ...} : let
  globalPeers = [
    {
      idx = 1;
      private = "QHWxUiMMnCNdYlUUowXOhwA9dqFj0mPW+hlrfL8+33U=";
      public = "vcLLUIzLkTuv60vV7+nMsWUDSh+R+g+YR2Jn+Kjlc1g=";
    }
    {
      idx = 2;
      private = "yEhxYq3jxZMXMBCofjGNG+GomUmrwVUHvWvuFRWC2HU=";
      public = "AMYt7rajfDZfupU8+m/uQ8C+1R++zBrI+Z8wuQylXio=";
    }
    {
      idx = 3;
      private = "mDXJoAJmFaERpaP6Wvll/FqoGTxZA0CgnL68dFs4XGM=";
      public = "UWBRfv+VsUOLIi5iD++Zurl4wx+S+A6r39flcI3jyhA=";
    }
    {
      idx = 4;
      private = "gDINget094ZZNd2AV1lFxi3wOxbrtG74oZ/7FOEY30Q=";
      public = "/aqhS+qrcRMQWdCQjnnc++hggRqU+4s26EEVlnsFcTA=";
    }
    {
      idx = 5;
      private = "SBil1s3P9Uam6zJTdm/GOjgozaoVPha1ThS49OGbO0E=";
      public = "+l2Q+6Ap7y2MavVR1SR+qMQQUSp4shwNeFjcodhr+lw=";
    }
  ];

  peersFor = self: lib.remove self globalPeers;
in {
  name = "wireguard-interesting-keys";
  meta = with pkgs.stdenv.lib.maintainers; {
    maintainers = [ ma27 grahamc ];
  };

  nodes = lib.listToAttrs (map (self: {
    name = "peer${toString self.idx}";
    value = {
      networking.useDHCP = false;
      networking.interfaces.eth1 = {
        ipv4.addresses = lib.singleton {
          address = "192.168.0.${toString self.idx}";
          prefixLength = 24;
        };
      };

      networking.wireguard.interfaces.wg0 = {
        ips = [ "10.10.10.${toString self.idx}/24" ];
        listenPort = 12345;
        privateKey = self.private;
        peers = map (peer: {
          allowedIPs = [ "10.10.10.${toString peer.idx}/32" ];
          endpoint = "192.168.0.${toString peer.idx}:12345";
          publicKey = peer.public;
          persistentKeepalive = 1;
        }) (peersFor self);
      };
    };
  }) globalPeers);

  testScript = ''
    startAll;

    # Wait for wireguard-wg0 to be up
    ${lib.concatMapStringsSep "\n" (self: "$peer${toString self.idx}->waitForUnit(\"wireguard-wg0\");") globalPeers}

    ${lib.concatMapStringsSep "\n" (self: "print $peer${toString self.idx}->succeed(\"wg\");") globalPeers}

    # Ping all the peers
    ${lib.concatStringsSep "\n"
    (lib.flatten (map
      (self: map
        (peer: "$peer${toString self.idx}->succeed(\"ping -c1 10.10.10.${toString peer.idx}\");")
        (peersFor self)
      )
      globalPeers
    ))}
  '';
})

My suspicion was your system wasn't working because you had keys started with a /, however this test case does as well.

@sjau
Copy link

sjau commented Jun 20, 2019

On latest nixos unstable it still fails to properly start the wg peers.

And use domain names instead of IP addresses and it will very likely fail for you also becuase of missing dns resolution during boot up. Your test case just won't cover real bootup missing dns.

Simple solution would probably be to change the wg peer type from oneshot to simple with retry on fail. At least when this was the case before, everything worked fine.

@flokli
Copy link
Contributor

flokli commented Jun 25, 2019

@sjau can you open an issue with some more detail on how this fails (like log output of the failing units)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nixos/wireguard: should use WG_ENDPOINT_RESOLUTION_RETRIES
8 participants