New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wireguard: restart on failure\nAs a oneshot service, if the startup f… #61971
Conversation
@GrahamcOfBorg test wireguard |
As a oneshot service, if the startup failed it would never be attempted again. This is problematic when peer's addresses require DNS. DNS may not be reliably available at the time wireguard starts. Converting this to a simple service with Restart and RestartAfter directives allows the service to be reattempted, but at the cost of losing the oneshot semantics. Signed-off-by: Maximilian Bosch <maximilian@mbosch.me>
Tested locally and reformatted the commit message 👍 |
@sjau thanks! |
I think this should probably be backported, what do you think @Ma27? |
You're absolutely right. Tested on top of |
This seems like the wrong approach to me. Shouldn't we instead put the right notions into wg(8) so that hacks like this aren't needed? |
CC @Mic92 |
@zx2c4 what would you suggest wg(8) do instead? |
It is possible this unit will need to be retried many times for an unknowable amount of time before it succeeds in resolving. |
Well, right now we do this: for (unsigned int timeout = 1000000;;) {
ret = getaddrinfo(begin, end, &hints, &resolved);
if (!ret)
break;
timeout = timeout * 3 / 2;
/* The set of return codes that are "permanent failures". All other possibilities are potentially transient.
*
* This is according to https://sourceware.org/glibc/wiki/NameResolver which states:
* "From the perspective of the application that calls getaddrinfo() it perhaps
* doesn't matter that much since EAI_FAIL, EAI_NONAME and EAI_NODATA are all
* permanent failure codes and the causes are all permanent failures in the
* sense that there is no point in retrying later."
*
* So this is what we do, except FreeBSD removed EAI_NODATA some time ago, so that's conditional.
*/
if (ret == EAI_NONAME || ret == EAI_FAIL ||
#ifdef EAI_NODATA
ret == EAI_NODATA ||
#endif
timeout >= 90000000) {
free(mutable);
fprintf(stderr, "%s: `%s'\n", ret == EAI_SYSTEM ? strerror(errno) : gai_strerror(ret), value);
return false;
}
fprintf(stderr, "%s: `%s'. Trying again in %.2f seconds...\n", ret == EAI_SYSTEM ? strerror(errno) : gai_strerror(ret), value, timeout / 1000000.0);
usleep(timeout);
} Do you have any logs that indicate conclusively this is inadequate somehow? I haven't seen any high quality debugging or analysis before this infected bandaid of a patch went in. Let's start with the bad behavior you're seeing. Then we can divise a solution. |
Logs, no, but scenario yes. Imagine a case where you're using wireguard on a hard to reach system with unreliable power and unreliable internet. The power and internet fails. The power comes on, but the internet remains failed for a long time. Eventually the internet comes up. The current behavior in NixOS will allow the tunnel to come back up shortly after. |
Gotcha, so you'd like the DNS retry to be infinite? |
I'm don't know if that is the right behavior for wg(8), but somewhere -- yes -- there should be an effectively infinite retry. As an aside, due to the limitations of how wireguard implements DNS resolution (only doing the lookup once, not obeying TTLs, not supporting multiple A / AAAA records), I wonder if it would be better for wireguard to not do it at all. As an end user, I was surprised by these details. However, that is almost certainly not on topic for this thread. |
You stopped responding on IRC mid-discussion following your latest message, so in case that gets lost, I just coded this up: https://git.zx2c4.com/WireGuard/commit/?id=2f53ee1b81915072b55769f1a61a52b392c11daa Let me know if that helps the situation here. |
You might not remember, but we've tried to debug this problem for a few hours together in #wireguard on freenode. |
Type = "oneshot"; | ||
Type = "simple"; | ||
Restart = "on-failure"; | ||
RestartSec = "5s"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a restart limit in systemd or something, which could lead to the service not being restarted infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, yes, you need
unitConfig = {
StartLimitIntervalSec = 0; # ensure Restart= is always honoured
};
The intention of this PR now matches systemd-networkd's behavior for wireguard regarding trying to resolve infinite. |
No it doesn't. This makes systemd restart the service for any failure. The correct way to approach this is to keep it as a oneshot, but set the environment variable |
I'm pretty sure the resolution of whatever discussion we had was not anything resembling this PR, which should be reverted. |
That's right. There was no solution at all when we discussed that, so the solution til a few days ago was: https://github.com/sjau/nix-expressions/blob/master/wgStartFix.nix |
Well, now something certainly wrong has been committed. Please take ownership of the issue you've created and move forward with a proper fix. |
Just out of curiosity, why did you ban me from #wireguard on freenode? |
Motivation for this change
As a oneshot service, if the startup failed it would never be attempted again. This is problematic when peer's addresses require DNS. DNS may not be reliably available at the time wireguard starts. Converting this to a simple service with Restart and RestartAfter directives allows the service to be reattempted, but at the cost of losing the oneshot semantics.
See issue: #30459
Things done
sandbox
innix.conf
on non-NixOS)nix-shell -p nix-review --run "nix-review wip"
./result/bin/
)nix path-info -S
before and after)