Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971

Merged
merged 1 commit into from May 25, 2019

Conversation

sjau
Copy link

@sjau sjau commented May 23, 2019

Motivation for this change

As a oneshot service, if the startup failed it would never be attempted again. This is problematic when peer's addresses require DNS. DNS may not be reliably available at the time wireguard starts. Converting this to a simple service with Restart and RestartAfter directives allows the service to be reattempted, but at the cost of losing the oneshot semantics.

See issue: #30459

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@Ma27
Copy link
Member

Ma27 commented May 25, 2019

@GrahamcOfBorg test wireguard

As a oneshot service, if the startup failed it would never be attempted again.
This is problematic when peer's addresses require DNS. DNS may not be reliably available at
the time wireguard starts. Converting this to a simple service with Restart
and RestartAfter directives allows the service to be reattempted, but at
the cost of losing the oneshot semantics.

Signed-off-by: Maximilian Bosch <maximilian@mbosch.me>
@Ma27
Copy link
Member

Ma27 commented May 25, 2019

Tested locally and reformatted the commit message 👍

@Ma27 Ma27 merged commit 5fa9351 into NixOS:master May 25, 2019
@Ma27
Copy link
Member

Ma27 commented May 25, 2019

@sjau thanks!

@grahamc
Copy link
Member

grahamc commented May 25, 2019

I think this should probably be backported, what do you think @Ma27?

@Ma27
Copy link
Member

Ma27 commented May 25, 2019

You're absolutely right. Tested on top of release-19.03 and backported as ced7cfc.

@zx2c4
Copy link
Contributor

zx2c4 commented May 26, 2019

This seems like the wrong approach to me. Shouldn't we instead put the right notions into wg(8) so that hacks like this aren't needed?

@zx2c4
Copy link
Contributor

zx2c4 commented May 26, 2019

CC @Mic92

@grahamc
Copy link
Member

grahamc commented May 28, 2019

@zx2c4 what would you suggest wg(8) do instead?

@grahamc
Copy link
Member

grahamc commented May 28, 2019

It is possible this unit will need to be retried many times for an unknowable amount of time before it succeeds in resolving.

@zx2c4
Copy link
Contributor

zx2c4 commented May 28, 2019

Well, right now we do this:

        for (unsigned int timeout = 1000000;;) {
                ret = getaddrinfo(begin, end, &hints, &resolved);
                if (!ret)
                        break;
                timeout = timeout * 3 / 2;
                /* The set of return codes that are "permanent failures". All other possibilities are potentially transient.
                 *
                 * This is according to https://sourceware.org/glibc/wiki/NameResolver which states:
                 *      "From the perspective of the application that calls getaddrinfo() it perhaps
                 *       doesn't matter that much since EAI_FAIL, EAI_NONAME and EAI_NODATA are all
                 *       permanent failure codes and the causes are all permanent failures in the
                 *       sense that there is no point in retrying later."
                 *
                 * So this is what we do, except FreeBSD removed EAI_NODATA some time ago, so that's conditional.
                 */
                if (ret == EAI_NONAME || ret == EAI_FAIL ||
                        #ifdef EAI_NODATA
                                ret == EAI_NODATA ||
                        #endif
                                timeout >= 90000000) {
                        free(mutable);
                        fprintf(stderr, "%s: `%s'\n", ret == EAI_SYSTEM ? strerror(errno) : gai_strerror(ret), value);
                        return false;
                }
                fprintf(stderr, "%s: `%s'. Trying again in %.2f seconds...\n", ret == EAI_SYSTEM ? strerror(errno) : gai_strerror(ret), value, timeout / 1000000.0);
                usleep(timeout);
        }

Do you have any logs that indicate conclusively this is inadequate somehow? I haven't seen any high quality debugging or analysis before this infected bandaid of a patch went in. Let's start with the bad behavior you're seeing. Then we can divise a solution.

@grahamc
Copy link
Member

grahamc commented May 28, 2019

Logs, no, but scenario yes. Imagine a case where you're using wireguard on a hard to reach system with unreliable power and unreliable internet. The power and internet fails. The power comes on, but the internet remains failed for a long time. Eventually the internet comes up. The current behavior in NixOS will allow the tunnel to come back up shortly after.

@zx2c4
Copy link
Contributor

zx2c4 commented May 28, 2019

Gotcha, so you'd like the DNS retry to be infinite?

@grahamc
Copy link
Member

grahamc commented May 28, 2019

I'm don't know if that is the right behavior for wg(8), but somewhere -- yes -- there should be an effectively infinite retry.

As an aside, due to the limitations of how wireguard implements DNS resolution (only doing the lookup once, not obeying TTLs, not supporting multiple A / AAAA records), I wonder if it would be better for wireguard to not do it at all. As an end user, I was surprised by these details. However, that is almost certainly not on topic for this thread.

@zx2c4
Copy link
Contributor

zx2c4 commented May 28, 2019

You stopped responding on IRC mid-discussion following your latest message, so in case that gets lost, I just coded this up: https://git.zx2c4.com/WireGuard/commit/?id=2f53ee1b81915072b55769f1a61a52b392c11daa

Let me know if that helps the situation here.

@sjau
Copy link
Author

sjau commented May 29, 2019

@zx2c4

You might not remember, but we've tried to debug this problem for a few hours together in #wireguard on freenode.

Type = "oneshot";
Type = "simple";
Restart = "on-failure";
RestartSec = "5s";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a restart limit in systemd or something, which could lead to the service not being restarted infinite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, yes, you need

      unitConfig = {
        StartLimitIntervalSec = 0; # ensure Restart= is always honoured
      };

@Mic92
Copy link
Member

Mic92 commented May 30, 2019

The intention of this PR now matches systemd-networkd's behavior for wireguard regarding trying to resolve infinite.

@zx2c4
Copy link
Contributor

zx2c4 commented May 30, 2019

No it doesn't. This makes systemd restart the service for any failure.

The correct way to approach this is to keep it as a oneshot, but set the environment variable WG_ENDPOINT_RESOLUTION_RETRIES=infinity on the next snapshot.

@zx2c4
Copy link
Contributor

zx2c4 commented May 30, 2019

You might not remember, but we've tried to debug this problem for a few hours together in #wireguard on freenode.

I'm pretty sure the resolution of whatever discussion we had was not anything resembling this PR, which should be reverted.

@sjau
Copy link
Author

sjau commented May 30, 2019

I'm pretty sure the resolution of whatever discussion we had was not anything resembling this PR, which should be reverted.

That's right. There was no solution at all when we discussed that, so the solution til a few days ago was:

https://github.com/sjau/nix-expressions/blob/master/wgStartFix.nix

@zx2c4
Copy link
Contributor

zx2c4 commented May 30, 2019

Well, now something certainly wrong has been committed. Please take ownership of the issue you've created and move forward with a proper fix.

@sjau
Copy link
Author

sjau commented Jun 13, 2019

@zx2c4

Just out of curiosity, why did you ban me from #wireguard on freenode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants