Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/security/acme: order after nss-lookup.target #99901

Merged
merged 1 commit into from Oct 6, 2020

Conversation

andir
Copy link
Member

@andir andir commented Oct 6, 2020

Motivation for this change

This should hopefully solve races with DNS servers (such as unbound)
during the activation of a new generation. Previously unbound could
still be unavailable and thus the acme script would fail.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)

This should hopefully solve races with DNS servers (such as unbound)
during the activation of a new generation. Previously unbound could
still be unavailable and thus the acme script would fail.
@andir andir requested a review from m1cr0man October 6, 2020 21:03
@andir
Copy link
Member Author

andir commented Oct 6, 2020

This would also be a good candidate for a backport to 20.09 (cc @NixOS/nixos-release-managers @Nixos/acme)

@worldofpeace
Copy link
Contributor

I approve of a backport of this

@mweinelt
Copy link
Member

mweinelt commented Oct 20, 2020

Still regularly seeing errors like the one below on 20.09, even with this change.

After=network.target network-online.target acme-fixperms.service nss-lookup.target acme-selfsigned-helios.lossy.network.service
Oct 20 21:05:04 helios acme-helios.lossy.network-start[9096]: 2020/10/20 21:05:04 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: device or resource busy

@andir
Copy link
Member Author

andir commented Oct 20, 2020

Yeah, I saw them as well. This might be related to the DNS servers being "ready" before they actually are?

This is likely not a complete fix for it but still IMHO a change that is worth carrying as the order is explicitly stated.

@arianvp
Copy link
Member

arianvp commented Oct 20, 2020

Having a socket-activated dns server like resolved should help here. As systemd will buffer the requests whilst the server is down or restarting.

@mweinelt
Copy link
Member

Then we need to rebuild unbound with --enable-systemd, since it apparently does support socket activation. https://github.com/NLnetLabs/unbound/blob/master/contrib/unbound.socket.in

@andir
Copy link
Member Author

andir commented Oct 20, 2020

Having a socket-activated dns server like resolved should help here. As systemd will buffer the requests whilst the server is down or restarting.

Unfortunately I do not consider resolved production ready in any sense. Also (in my case) unbound does proper systemd readiness indication and thus the startup ordering should be fine.

I have a WIP branch for rewriting the unbound expression in nixpkgs as currently it is just cr*p.

@andir
Copy link
Member Author

andir commented Oct 20, 2020

Previously mentioned unbound rework that might help with this situation: https://github.com/andir/nixpkgs/tree/unbound-systemd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants