Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix letsencrypt #60219

Merged
merged 3 commits into from Aug 29, 2019
Merged

Fix letsencrypt #60219

merged 3 commits into from Aug 29, 2019

Conversation

arianvp
Copy link
Member

@arianvp arianvp commented Apr 25, 2019

Motivation for this change

Fixes #60180

Futhermore refactors the module a bit. I'm trying to get rid of PermissionStartOnly

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@arianvp
Copy link
Member Author

arianvp commented Jun 30, 2019

I am planning to deprecate the activationDelay option as its implementation is a bit confusing, and it can better be replaced with systemd dependencies instead of waiting for an arbitrary time. (e.g. what happens if the time passes, but whatever script you run during the delay didn't finish??). It is currently implemented by calling systemctl start in a postRun which requires PermissisionStartOnly which I want to get rid of because it is deprecated

I think a good replacement instead would be to create a acme-wait-dns-published.service yourself ,, similar to network-wait-online.service which busyloops until a condition is met and then do:

# acme-wait-dns-published.service
RequiredBy=acme-${cert}.service
After=acme-${cert}.service

This way, the delay isn't an arbitrary number, but mandated by the script that you use to publish to DANE/TLSA.
and it greatly simplifies the letsencrypt activation logic, which is now hard to follow.

See e70d293#commitcomment-34131909 for context

@arianvp
Copy link
Member Author

arianvp commented Jun 30, 2019

CC @pngwjpgh

@grahamc
Copy link
Member

grahamc commented Jun 30, 2019

SGTM. Any fixed "delay" value is inherently wrong.

@arianvp
Copy link
Member Author

arianvp commented Jun 30, 2019

One big problem:

You can't waitForUnit on oneshot services, as their state goes from inactive => activating => inactive. This makes the tests currently fail.

If anybody has an idea on how to test for oneshots firing, please let me know

@grahamc
Copy link
Member

grahamc commented Jun 30, 2019 via email

@arianvp
Copy link
Member Author

arianvp commented Jun 30, 2019 via email

@arianvp arianvp marked this pull request as ready for review July 1, 2019 07:24
@arianvp
Copy link
Member Author

arianvp commented Jul 1, 2019

@grahamc I've fixed the test by letting each oneshot activate a target (Which remains active after running) to use as a probe to test for unit activation. It's a bit hacky, but it works!

@arianvp
Copy link
Member Author

arianvp commented Jul 1, 2019

@GrahamcOfBorg test acme

@arianvp
Copy link
Member Author

arianvp commented Jul 1, 2019

A few problems remain (that were already present before I started coding):

  1. certificates add dependencies to nginx and lighttpd even if they're not used by said services.
    Say you create a certificate, and an nginx vhost, but that vhost doesn't have enableACME. then nginx will still wait for certificates to refresh even if those aren't used by nginx at all
  2. lighttpd seems to have no direct ACME enable support whatsoever so it seems weird to make lighttpd.service pull in all these certs even if it doesn't use them

Number 2. was introduced by @bjornfor in 7a0e958

Would you mind if I remove this coupling altogether, and document instead that people
can do:

lighttpd.wants  =  [ "acme-${cert}.service acme-selfsigned-${cert}.service" ];
lighttpd.after = ["acme-selfsigned-${cert}.service"];

whenever you want to depend on certs being generated on startup?

For nginx we can probably do this automatically when a vhost has enableACME = true; but lets make sure it only pulls in the certs for that specific vhost instead of all certs

This will also fix #62958

@bjornfor
Copy link
Contributor

bjornfor commented Jul 1, 2019

@arianvp: I don't mind. I only added that line because nginx had it, and it felt like the natural thing to do at that time.

@arianvp
Copy link
Member Author

arianvp commented Jul 13, 2019

This is ready for review. I can refactor even more but lets keep that for a new PR or a new NixOS Module altogether. At least the bugs that were present before are fixed now.

@arianvp
Copy link
Member Author

arianvp commented Jul 13, 2019

I think we should do the big refactor to DNS challenges and lego in a separate PR and maybe even in a separate NixOS module. it's not worth keeping backwards compatibility with what we have currently I think. It might hamper us coming up with better designs. @mweinelt would you perhaps want to take a look at this PR?

nixos/modules/rename.nix Outdated Show resolved Hide resolved
@arianvp arianvp force-pushed the fix-letsencrypt branch 2 times, most recently from 910c6cb to 89af1ae Compare August 25, 2019 11:48
Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase this on latest master?

nixos/doc/manual/release-notes/rl-1909.xml Show resolved Hide resolved
When subsequent certificates would be added, they would
not wake up nginx correctly due to target units only being triggered
once. We now added more fine-grained systemd dependencies to make sure
nginx always is aware of new certificates and doesn't restart too early
resulting in a crash.

Furthermore, the acme module has been refactored. Mostly to get
rid of the deprecated PermissionStartOnly systemd options which were
deprecated. Below is a summary of changes made.

* Use SERVICE_RESULT to determine status
This was added in systemd v232. we don't have to keep track
of the EXITCODE ourselves anymore.

* Add regression test for requesting mutliple domains

* Deprecate 'directory' option
We now use systemd's StateDirectory option to manage
create and permissions of the acme state directory.

* The webroot is created using a systemd.tmpfiles.rules rule
instead of the preStart script.

* Depend on certs directly

By getting rid of the target units, we make sure ordering
is correct in the case that you add new certs after already
having deployed some.

Reason it broke before:  acme-certificates.target would
be in active state, and if you then add a new cert, it
would still be active and hence nginx would restart
without even requesting a new cert. Not good!  We
make the dependencies more fine-grained now. this should fix that

* Remove activationDelay option

It complicated the code a lot, and is rather arbitrary. What if
your activation script takes more than activationDelay seconds?

Instead, one should use systemd dependencies to make sure some
action happens before setting the certificate live.

e.g. If you want to wait until your cert is published in DNS DANE /
TLSA, you could create a unit that blocks until it appears in DNS:

```
RequiredBy=acme-${cert}.service
After=acme-${cert}.service
ExecStart=publish-wait-for-dns-script
```
arianvp and others added 2 commits August 29, 2019 16:13
Co-Authored-By: Florian Klink <flokli@flokli.de>
@flokli flokli merged commit 604b7c1 into NixOS:master Aug 29, 2019
@flokli
Copy link
Contributor

flokli commented Aug 29, 2019

Thanks!

@arianvp arianvp deleted the fix-letsencrypt branch August 29, 2019 14:39
ikervagyok added a commit to ikervagyok/nixpkgs that referenced this pull request Aug 29, 2019
@Shados
Copy link
Member

Shados commented Sep 2, 2019

FWIW, I was actually using security.acme.directory, so removing that has given me some grief >.>.

@arianvp
Copy link
Member Author

arianvp commented Sep 2, 2019

Feel free to come up with an alternative. We could use systemd.tmpfiles for example to manage the directory instead of StateDirectory. What is important is that we have the behaviour without depending on the deprecated PermissionStartOnly behaviour of systemd. however I don't personally have time to work on this at least the coming two weeks. But I'm happy to review the PR that reintroduces that option. I would poke @aanderse for advice if he has time.

@Shados
Copy link
Member

Shados commented Sep 2, 2019

Not sure tmpfiles is the right approach either... the usage of it in the refactored module is also somehow not working right for me -- the activation script snippet for it is failing with "Detected unsafe path transition" =/. I'll take a look into this tonight.

@arianvp
Copy link
Member Author

arianvp commented Sep 2, 2019

This is error log is generated because of:

        /* Returns true if the transition from a to b is safe, i.e. that we never transition from unprivileged to
         * privileged files or directories. Why bother? So that unprivileged code can't symlink to privileged files
         * making us believe we read something safe even though it isn't safe in the specific context we open it in. */

https://github.com/systemd/systemd/blob/6fd79cca68a170c88e87b5d95cd7a44953df94e2/src/basic/fs-util.c#L663
*/

Note that it's just a warning that is logged. not something that makes fatally fail. But still; it's something worth having a look at.

Poking @aanderse

I still think that tmpfiles is the right approach here. Note that tmpfiles is a bit of a deceptive name (because of historic reasons) but is the adviced tool to use if StateDirectory= does not suffice to set up state directories for a service (e.g. when you want to allow state to live outside of /var/lib) (See #56265 for context)

@aanderse
Copy link
Member

aanderse commented Sep 2, 2019

  1. Entirely out of curiosity: what directory are you using, and will symlinks not solve your problem?
  2. If tmpfiles.d is not explicitly told who ownsa directory it will assign root as the owner. This includes parent directories of the directory you specify. Under certain permission schemes tmpfiles.d decides this is not safe (the right choice) and throws the error/warning. See nixos/deluge: fix directory creation errors #67888 for concrete example.

I'm happy to hear the what and why of your custom directory to change my opinion, but as it currently stands if symlinks would do the job I continue to agree with the decision to remove the directory option.

@Shados
Copy link
Member

Shados commented Sep 2, 2019

@arianvp Hmm, yeah, I see why my setup is causing the warning now. My config predates services.nginx.virtualHosts.<name>.enableACME existing, so the link between the acme webroot and the nginx one was manually defined, and the ownership of that portion of the directory tree ended up being a little odd. I'll change it to use the newer stuff, which is nicer, so that's good.

@aanderse Sure, I can detail the reasoning a bit: essentially, I keep any service-related data that is worth backing up* under a directory tree rooted at /srv, which simplifies administration in a few ways (e.g. I don't have to back up different service directories per-server, I can configure filesystem/disk-related stuff for all relevant service data on a system in a single place).

/var/lib doesn't qualify for this by because it is just generically for service/application state, not specifically for stuff I consider valuable, meaning it inevitably ends up containing a bunch of random stuff I don't care about and don't want consuming backup space. The semantics don't match up.

Could I use a symlink? Often. But then I have more problems:

  • If I decide to manually create the symlink, I now have to remember to do that if I am ever moving or replicating this configuration elsewhere.
  • If I decide to write a service or activation script or something else to ensure the symlink exists, I then have to test that, ensure it has the appropriate dependencies, and try to cover any edge cases that might come up.
  • I can always symlink from /srv/<something> to elsewhere, but I can't always do the reverse -- the application in question may not like being given a symlink in its path, for one reason or another, so instead I'd have to use a bind mount (which then has its own potential issues).

Pointing the service in question directly to the directory I actually want it to use is always going to be the better option.

*: By "worth backing up", I mean "not trivially reproduceable". LE certs fall under this classification because acquiring them is rate-limited, so having to recreate them en masse would potentially be problematic. Combine that with HTTP Strict Transport Security and you could end up with effectively inaccessible domains for a while.

@aanderse
Copy link
Member

aanderse commented Sep 2, 2019

@Shados to address your first two points: if you use systemd.tmpfiles.rules to create the symlink and those problems disappear. I see your third point as something worth serious consideration, but on a service by service basis.

I'm not against having an option for specifying the state directory when there is a justification for such an option (like an application that can't deal with symlinks as you mentioned), but it would be nice if there was justification, as opposed to preference. But then again... this is just my opinion, and I don't make the rules.

@Shados
Copy link
Member

Shados commented Sep 4, 2019

@aanderse I got around to taking the symlink approach with systemd.tmpfiles.rules. It works, but it is deceptively complicated, and it quickly became apparent that this is really not what systemd-tmpfiles is designed for. Here's a bit of a post-mortem, so to speak:

Initially, I went with just a directory entry for /srv/acme (with the same user and owner that I have the acme services running as), and a symlink entry for /var/lib/acme -> /srv/acme, but this caused an issue.

Namely, I was back to getting the "unsafe path transition" warnings. And while this is just a warning for systemd-tmpfiles, and it still creates the path(s), it is an error for NixOS because the non-0 return code causes the activation script snippet to fail, which means the system profile activation does not continue.

The cause of the warning was: systemd-tmpfiles will ensure that leading directories are implicitly created if needed, but they will be owned by root with an access mode of 0755. As a result, there was an unsafe path transition going from /srv/acme to /srv/acme/acme-challenge, because the latter is implicitly created as root on the way to making /var/lib/acme/acme-challenge/.well-known/acme-challenge (through the symlink).

man tmpfiles.d says In order to create [the leading directories] with different modes or ownership make sure to add appropriate d lines, so I wrote a Nix helper to map a portion of a path to a list of largely identical systemd.tmpfiles.rules directory creation lines. I do note an issue with this approach currently:
If you have two separate entries creating the same directory with different permissions or ownership, you'll get a warning on stderr to that effect during activation, but it appears the return code is still 0, so it is non-obvious.
We could solve that in NixOS by making systemd.tmpfiles.rules a structured attrsOf tmpfileRuleType instead of a listOf str, and I would strongly suggest this be done if the recommendation going forward is to handle service/etc. directory creation using systemd.tmpfiles.rules. It would also then be possible to catch some potential "unsafe path transitions" at eval-time.

All up: I do think that using a tool designed to declaratively manipulate a filesystem is the right approach to solving this kind of issue in NixOS, but that using systemd-tmpfiles as-is may not be. It ended up taking about as much testing and handling of edge cases as I expected a hand-written, imperative approach would have.

dtzWill pushed a commit to dtzWill/nixpkgs that referenced this pull request Sep 12, 2019
* nixos/acme: Fix ordering of cert requests

When subsequent certificates would be added, they would
not wake up nginx correctly due to target units only being triggered
once. We now added more fine-grained systemd dependencies to make sure
nginx always is aware of new certificates and doesn't restart too early
resulting in a crash.

Furthermore, the acme module has been refactored. Mostly to get
rid of the deprecated PermissionStartOnly systemd options which were
deprecated. Below is a summary of changes made.

* Use SERVICE_RESULT to determine status
This was added in systemd v232. we don't have to keep track
of the EXITCODE ourselves anymore.

* Add regression test for requesting mutliple domains

* Deprecate 'directory' option
We now use systemd's StateDirectory option to manage
create and permissions of the acme state directory.

* The webroot is created using a systemd.tmpfiles.rules rule
instead of the preStart script.

* Depend on certs directly

By getting rid of the target units, we make sure ordering
is correct in the case that you add new certs after already
having deployed some.

Reason it broke before:  acme-certificates.target would
be in active state, and if you then add a new cert, it
would still be active and hence nginx would restart
without even requesting a new cert. Not good!  We
make the dependencies more fine-grained now. this should fix that

* Remove activationDelay option

It complicated the code a lot, and is rather arbitrary. What if
your activation script takes more than activationDelay seconds?

Instead, one should use systemd dependencies to make sure some
action happens before setting the certificate live.

e.g. If you want to wait until your cert is published in DNS DANE /
TLSA, you could create a unit that blocks until it appears in DNS:

```
RequiredBy=acme-${cert}.service
After=acme-${cert}.service
ExecStart=publish-wait-for-dns-script
```

(cherry picked from commit 604b7c1)
dtzWill pushed a commit to dtzWill/nixpkgs that referenced this pull request Sep 12, 2019
* nixos/acme: Fix ordering of cert requests

When subsequent certificates would be added, they would
not wake up nginx correctly due to target units only being triggered
once. We now added more fine-grained systemd dependencies to make sure
nginx always is aware of new certificates and doesn't restart too early
resulting in a crash.

Furthermore, the acme module has been refactored. Mostly to get
rid of the deprecated PermissionStartOnly systemd options which were
deprecated. Below is a summary of changes made.

* Use SERVICE_RESULT to determine status
This was added in systemd v232. we don't have to keep track
of the EXITCODE ourselves anymore.

* Add regression test for requesting mutliple domains

* Deprecate 'directory' option
We now use systemd's StateDirectory option to manage
create and permissions of the acme state directory.

* The webroot is created using a systemd.tmpfiles.rules rule
instead of the preStart script.

* Depend on certs directly

By getting rid of the target units, we make sure ordering
is correct in the case that you add new certs after already
having deployed some.

Reason it broke before:  acme-certificates.target would
be in active state, and if you then add a new cert, it
would still be active and hence nginx would restart
without even requesting a new cert. Not good!  We
make the dependencies more fine-grained now. this should fix that

* Remove activationDelay option

It complicated the code a lot, and is rather arbitrary. What if
your activation script takes more than activationDelay seconds?

Instead, one should use systemd dependencies to make sure some
action happens before setting the certificate live.

e.g. If you want to wait until your cert is published in DNS DANE /
TLSA, you could create a unit that blocks until it appears in DNS:

```
RequiredBy=acme-${cert}.service
After=acme-${cert}.service
ExecStart=publish-wait-for-dns-script
```

(cherry picked from commit 604b7c1)
dpausp pushed a commit to flyingcircusio/nixpkgs that referenced this pull request Sep 24, 2019
* nixos/acme: Fix ordering of cert requests

When subsequent certificates would be added, they would
not wake up nginx correctly due to target units only being triggered
once. We now added more fine-grained systemd dependencies to make sure
nginx always is aware of new certificates and doesn't restart too early
resulting in a crash.

Furthermore, the acme module has been refactored. Mostly to get
rid of the deprecated PermissionStartOnly systemd options which were
deprecated. Below is a summary of changes made.

* Use SERVICE_RESULT to determine status
This was added in systemd v232. we don't have to keep track
of the EXITCODE ourselves anymore.

* Add regression test for requesting mutliple domains

* Deprecate 'directory' option
We now use systemd's StateDirectory option to manage
create and permissions of the acme state directory.

* The webroot is created using a systemd.tmpfiles.rules rule
instead of the preStart script.

* Depend on certs directly

By getting rid of the target units, we make sure ordering
is correct in the case that you add new certs after already
having deployed some.

Reason it broke before:  acme-certificates.target would
be in active state, and if you then add a new cert, it
would still be active and hence nginx would restart
without even requesting a new cert. Not good!  We
make the dependencies more fine-grained now. this should fix that

* Remove activationDelay option

It complicated the code a lot, and is rather arbitrary. What if
your activation script takes more than activationDelay seconds?

Instead, one should use systemd dependencies to make sure some
action happens before setting the certificate live.

e.g. If you want to wait until your cert is published in DNS DANE /
TLSA, you could create a unit that blocks until it appears in DNS:

```
RequiredBy=acme-${cert}.service
After=acme-${cert}.service
ExecStart=publish-wait-for-dns-script
```

FCIO:

* cherry-picked from nixpkgs-unstable ba447fa
* upstream PR NixOS#60219
* included in 19.09
dpausp added a commit to flyingcircusio/nixpkgs that referenced this pull request Sep 24, 2019
FCIO:

This is a patch for the upstream PR NixOS#60219.
It improved the situation but there were still failures because the
challenge was sent to the webserver before it was accepting requests for
the new domain. Maybe using another LE client would fix this, too.
Should be discussed with upstream.
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/please-update-to-latest-19-09-if-using-letsencrypt/4595/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants