wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

sjau · 2019-05-23T20:28:54Z

Motivation for this change

As a oneshot service, if the startup failed it would never be attempted again. This is problematic when peer's addresses require DNS. DNS may not be reliably available at the time wireguard starts. Converting this to a simple service with Restart and RestartAfter directives allows the service to be reattempted, but at the cost of losing the oneshot semantics.

See issue: #30459

Things done

Ma27 · 2019-05-25T10:09:53Z

@GrahamcOfBorg test wireguard

As a oneshot service, if the startup failed it would never be attempted again. This is problematic when peer's addresses require DNS. DNS may not be reliably available at the time wireguard starts. Converting this to a simple service with Restart and RestartAfter directives allows the service to be reattempted, but at the cost of losing the oneshot semantics. Signed-off-by: Maximilian Bosch <maximilian@mbosch.me>

Ma27 · 2019-05-25T14:34:58Z

Tested locally and reformatted the commit message 👍

Ma27 · 2019-05-25T14:37:05Z

@sjau thanks!

grahamc · 2019-05-25T14:37:20Z

I think this should probably be backported, what do you think @Ma27?

Ma27 · 2019-05-25T15:48:55Z

You're absolutely right. Tested on top of release-19.03 and backported as ced7cfc.

zx2c4 · 2019-05-26T11:10:35Z

This seems like the wrong approach to me. Shouldn't we instead put the right notions into wg(8) so that hacks like this aren't needed?

zx2c4 · 2019-05-26T11:10:50Z

CC @Mic92

grahamc · 2019-05-28T19:39:28Z

@zx2c4 what would you suggest wg(8) do instead?

grahamc · 2019-05-28T19:40:14Z

It is possible this unit will need to be retried many times for an unknowable amount of time before it succeeds in resolving.

zx2c4 · 2019-05-28T19:42:34Z

Well, right now we do this:

        for (unsigned int timeout = 1000000;;) {
                ret = getaddrinfo(begin, end, &hints, &resolved);
                if (!ret)
                        break;
                timeout = timeout * 3 / 2;
                /* The set of return codes that are "permanent failures". All other possibilities are potentially transient.
                 *
                 * This is according to https://sourceware.org/glibc/wiki/NameResolver which states:
                 *      "From the perspective of the application that calls getaddrinfo() it perhaps
                 *       doesn't matter that much since EAI_FAIL, EAI_NONAME and EAI_NODATA are all
                 *       permanent failure codes and the causes are all permanent failures in the
                 *       sense that there is no point in retrying later."
                 *
                 * So this is what we do, except FreeBSD removed EAI_NODATA some time ago, so that's conditional.
                 */
                if (ret == EAI_NONAME || ret == EAI_FAIL ||
                        #ifdef EAI_NODATA
                                ret == EAI_NODATA ||
                        #endif
                                timeout >= 90000000) {
                        free(mutable);
                        fprintf(stderr, "%s: `%s'\n", ret == EAI_SYSTEM ? strerror(errno) : gai_strerror(ret), value);
                        return false;
                }
                fprintf(stderr, "%s: `%s'. Trying again in %.2f seconds...\n", ret == EAI_SYSTEM ? strerror(errno) : gai_strerror(ret), value, timeout / 1000000.0);
                usleep(timeout);
        }

Do you have any logs that indicate conclusively this is inadequate somehow? I haven't seen any high quality debugging or analysis before this infected bandaid of a patch went in. Let's start with the bad behavior you're seeing. Then we can divise a solution.

grahamc · 2019-05-28T19:46:30Z

Logs, no, but scenario yes. Imagine a case where you're using wireguard on a hard to reach system with unreliable power and unreliable internet. The power and internet fails. The power comes on, but the internet remains failed for a long time. Eventually the internet comes up. The current behavior in NixOS will allow the tunnel to come back up shortly after.

zx2c4 · 2019-05-28T19:49:34Z

Gotcha, so you'd like the DNS retry to be infinite?

grahamc · 2019-05-28T19:55:42Z

I'm don't know if that is the right behavior for wg(8), but somewhere -- yes -- there should be an effectively infinite retry.

As an aside, due to the limitations of how wireguard implements DNS resolution (only doing the lookup once, not obeying TTLs, not supporting multiple A / AAAA records), I wonder if it would be better for wireguard to not do it at all. As an end user, I was surprised by these details. However, that is almost certainly not on topic for this thread.

zx2c4 · 2019-05-28T20:54:40Z

You stopped responding on IRC mid-discussion following your latest message, so in case that gets lost, I just coded this up: https://git.zx2c4.com/WireGuard/commit/?id=2f53ee1b81915072b55769f1a61a52b392c11daa

Let me know if that helps the situation here.

sjau · 2019-05-29T02:48:58Z

@zx2c4

You might not remember, but we've tried to debug this problem for a few hours together in #wireguard on freenode.

Mic92 · 2019-05-30T12:57:47Z

nixos/modules/services/networking/wireguard.nix

-          Type = "oneshot";
+          Type = "simple";
+          Restart = "on-failure";
+          RestartSec = "5s";


I think there was a restart limit in systemd or something, which could lead to the service not being restarted infinite.

For completeness, yes, you need

unitConfig = { StartLimitIntervalSec = 0; # ensure Restart= is always honoured };

Mic92 · 2019-05-30T12:59:17Z

The intention of this PR now matches systemd-networkd's behavior for wireguard regarding trying to resolve infinite.

zx2c4 · 2019-05-30T14:30:00Z

No it doesn't. This makes systemd restart the service for any failure.

The correct way to approach this is to keep it as a oneshot, but set the environment variable WG_ENDPOINT_RESOLUTION_RETRIES=infinity on the next snapshot.

zx2c4 · 2019-05-30T14:30:56Z

You might not remember, but we've tried to debug this problem for a few hours together in #wireguard on freenode.

I'm pretty sure the resolution of whatever discussion we had was not anything resembling this PR, which should be reverted.

sjau · 2019-05-30T16:03:40Z

I'm pretty sure the resolution of whatever discussion we had was not anything resembling this PR, which should be reverted.

That's right. There was no solution at all when we discussed that, so the solution til a few days ago was:

https://github.com/sjau/nix-expressions/blob/master/wgStartFix.nix

zx2c4 · 2019-05-30T16:11:31Z

Well, now something certainly wrong has been committed. Please take ownership of the issue you've created and move forward with a proper fix.

sjau · 2019-06-13T06:03:45Z

@zx2c4

Just out of curiosity, why did you ban me from #wireguard on freenode?

ofborg bot added 6.topic: nixos 8.has: module (update) 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels May 23, 2019

Ma27 force-pushed the wg_client_start branch from 152ed29 to 1bff53c Compare May 25, 2019 14:34

Ma27 merged commit 5fa9351 into NixOS:master May 25, 2019

sjau deleted the wg_client_start branch May 25, 2019 20:32

sjau mentioned this pull request May 26, 2019

Wireguard - waiting for DNS before trying to start the interface #30459

Closed

Mic92 reviewed May 30, 2019

View reviewed changes

Mic92 mentioned this pull request May 31, 2019

nixos/wireguard: should use WG_ENDPOINT_RESOLUTION_RETRIES #62288

Closed

grahamc mentioned this pull request May 31, 2019

wireguard: 0.0.20190406 -> 0.0.20190531 and Change peers without tearing down the interface, handle DNS failures better #62325

Merged

10 tasks

Shados mentioned this pull request Jul 24, 2019

Wireguard doesn't bring up peers #63869

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

sjau commented May 23, 2019

Ma27 commented May 25, 2019

Ma27 commented May 25, 2019

Ma27 commented May 25, 2019

grahamc commented May 25, 2019

Ma27 commented May 25, 2019

zx2c4 commented May 26, 2019

zx2c4 commented May 26, 2019

grahamc commented May 28, 2019

grahamc commented May 28, 2019

zx2c4 commented May 28, 2019

grahamc commented May 28, 2019

zx2c4 commented May 28, 2019

grahamc commented May 28, 2019

zx2c4 commented May 28, 2019 •

edited

sjau commented May 29, 2019

Mic92 May 30, 2019

nh2 May 16, 2021

Mic92 commented May 30, 2019

zx2c4 commented May 30, 2019 •

edited

zx2c4 commented May 30, 2019

sjau commented May 30, 2019

zx2c4 commented May 30, 2019

sjau commented Jun 13, 2019

wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

Conversation

sjau commented May 23, 2019

Motivation for this change

Things done

Ma27 commented May 25, 2019

Ma27 commented May 25, 2019

Ma27 commented May 25, 2019

grahamc commented May 25, 2019

Ma27 commented May 25, 2019

zx2c4 commented May 26, 2019

zx2c4 commented May 26, 2019

grahamc commented May 28, 2019

grahamc commented May 28, 2019

zx2c4 commented May 28, 2019

grahamc commented May 28, 2019

zx2c4 commented May 28, 2019

grahamc commented May 28, 2019

zx2c4 commented May 28, 2019 • edited

sjau commented May 29, 2019

Mic92 May 30, 2019

Choose a reason for hiding this comment

nh2 May 16, 2021

Choose a reason for hiding this comment

Mic92 commented May 30, 2019

zx2c4 commented May 30, 2019 • edited

zx2c4 commented May 30, 2019

sjau commented May 30, 2019

zx2c4 commented May 30, 2019

sjau commented Jun 13, 2019

zx2c4 commented May 28, 2019 •

edited

zx2c4 commented May 30, 2019 •

edited