Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure acme module #91121

Merged
merged 6 commits into from Sep 6, 2020
Merged

Restructure acme module #91121

merged 6 commits into from Sep 6, 2020

Conversation

m1cr0man
Copy link
Contributor

@m1cr0man m1cr0man commented Jun 19, 2020

Motivation for this change

Closes #86184, closes #62958, closes #18440, closes #89502
Triggered by #84633

Rewriting the acme module since there was too much stuff carried over from when simp-le was used and the requirements have changed dramatically for lego.

  • Use an acme user and group, allow group override only
  • Use hashes to determine when certs actually need to regenerate
  • Avoid running lego more than necessary
  • Harden permissions
  • Support "systemctl clean" for cert regeneration
  • Support reuse of keys between some configuration changes
  • A new acme-finished-${cert}.target which can be relied on as an indicator that cert renewal is completed.
  • Reload services for apache and nginx which are triggered when certs are renewed (this replaces the old postRun hooks which required root).
    • Note I removed RemainAfterExit from the nginx-config-reload service. I don't see this causing any issues.
    • I actually did a bit of syncing between nginx and apache modules. I'm an apache user myself.
  • Stopped using full.pem in sslCertificateFile for both web servers. It's not correct, it should be fullchain, and the key should be loaded separately.

TODO:

  • Solve for the fact that root currently owns all certs Added a migration service to fix permissions
  • Check validMinDays in the start script against cert mtime Bad idea because selfsigned certs might be there
  • Add a note about multiple account creation and emails Done
  • Migrate extraDomains to a list Done
  • Deprecate user option Done, along with activationDelay
  • Set up selfsigned certs, consider using minica Done. Way nicer than the old method. Split into 2 services to avoid race conditions.
  • Update tests
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@m1cr0man
Copy link
Contributor Author

ping @NixOS/acme

@m1cr0man
Copy link
Contributor Author

m1cr0man commented Jul 9, 2020

Everything is working now. Hash-based cert folders works really nicely for triggering cert refreshes when configuration is changed. Before I take the PR out of WIP I want to write the new tests.

I would also like to know if anyone with more experience can tidy up the submodule (security.acme.certs.) changed + removed option assertions. I couldn't find a way to make use of functions in lib.modules so I had to partially reimplement them and I'm personally not happy with it.

@m1cr0man
Copy link
Contributor Author

Tests have now been rewritten. I incorporated some practical cert tests utilising openssl, and actually fixed a couple of my own bugs in the process :)

Left to do would be update the docs holistically, in particular adding instructions for complete cert renewal with systemctl clean. Beyond that, I would like feedback on whether I should do the following:

  • In tests, drop common/client and common/server (but keep snakeoil-certs.nix). They are only used in the acme testing.
  • Merge the dnsserver node into the acme node, thus reducing the number of nodes needed for the test to 3.

I'm opening this PR for review now, since it's in a production ready form and I have made use of it myself already.

@m1cr0man m1cr0man marked this pull request as ready for review July 15, 2020 00:14
@m1cr0man
Copy link
Contributor Author

I have updated the PR description with some more change summaries.

nixos/modules/security/acme.nix Show resolved Hide resolved
certToConfig = cert: data: let
acmeServer = if data.server != null then data.server else cfg.server;
useDns = data.dnsProvider != null;
keyName = builtins.replaceStrings ["*"] ["_"] data.domain;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we slugify this only here, and not for cert?

Is this there to make bash happy? Or is this some standard used by other clients as well? If it's only the former, I'd rather see our shell script being able to handle * (as we put it in double quotes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a "feature" of minica and lego. They replace the star with an underscore, so in order to reference the output files I need to do this myself too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An end user will never see this either, we rename all the certs in /var/lib/acme/ to something more easily referenced. I might add a test which covers having a star in the cert variable to ensure that the scripts always work in that scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Please add a amsll comment to it next to keyName, so this is understandable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit of testing, I discovered that if ${cert} contains a * it breaks things in multiple places, most crucially it broke the compilation of the systemd units. I have added an assertion recommending users to use the domain argument as we do in the tests.

@m1cr0man
Copy link
Contributor Author

I thought of a few more cases where things may go wrong, and added tests to cover them. In particular, the web server reload services were depending on the target - which stays alive, meaning that the renewal timer wouldn't be triggering a reload and old certs would stay on the web servers.

I encountered some problems ensuring that the reload took place without accidently triggering it as part of the test. The sync
commands I added ended up being essential and I'm not sure why, it seems like either node.succeed ends too early or there's an oddity of the vm's filesystem I'm not aware of.

@m1cr0man
Copy link
Contributor Author

I encountered an issue today in a production environment with NixOS where the certs for my mail server (dovecot and postfix) expired even though they were up to date. Obviously, this was because there was no trigger to reload them. It got me thinking, is a reload service on a per-dependency basis the right solution? For most users, simply specifying a postRun command was sufficient. Maybe we add support for running the postRun scripts as root in a separate service rather than as acme:${group} in the renewal service. Even better, we could add a reloadOnRenew array of services to reload, that way we can append it from various other modules (nginx, httpd, etc) when one becomes dependent on certs.

nixos/modules/misc/ids.nix Outdated Show resolved Hide resolved
nixos/tests/acme.nix Outdated Show resolved Hide resolved
nixos/tests/acme.nix Outdated Show resolved Hide resolved
nixos/tests/acme.nix Outdated Show resolved Hide resolved
@m1cr0man
Copy link
Contributor Author

Thanks for the review @Mic92 . If I could get someone else to give it a test run, and maybe if someone could help with the commented out assertions (aka assertions on submodules) this should be good for a merge.

nixos/modules/misc/ids.nix Outdated Show resolved Hide resolved
- Allow for key reuse when domains are the only thing that
  were changed.
- Fixed systemd service failure when preliminarySelfsigned
  was set to false
Copy link
Member

@mweinelt mweinelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been using several iterations of this locally and I feel confident that this is a big step forward. Thanks @m1cr0man!

@flokli flokli merged commit d704694 into NixOS:master Sep 6, 2020
20.09 Blockers automation moved this from To Do to Done Sep 6, 2020
@Ma27
Copy link
Member

Ma27 commented Sep 14, 2020

While switching a few of my systems to 20.09-alpha I experienced some weird behavior with this module:

apparently my network crashed during a deploy (most likely not related to this change) and then I got the following error in my journald for each acme-<domain>.service:

Sep 12 17:58:47 horst acme-<domain>-start[10440]: 2020/09/12 17:58:47 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: device or resource busy

Whenever I performed a retry with systemctl restart acme-<domain>.service (even after a reboot) I got the following error:

Sep 12 17:59:54 horst systemd[1]: Starting Renew ACME certificate for <domain>...
Sep 12 17:59:55 horst acme-<domain>-start[1043]: 2020/09/12 17:59:55 [INFO] [<domain>] acme: Obtaining bundled SAN certificate
Sep 12 17:59:56 horst acme-<domain>-start[1043]: 2020/09/12 17:59:56 Could not obtain certificates:
Sep 12 17:59:56 horst acme-<domain>-start[1043]:         acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:
Sep 12 17:59:56 horst systemd[1]: acme-<domain>.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:59:56 horst systemd[1]: acme-<domain>.service: Failed with result 'exit-code'.
Sep 12 17:59:56 horst systemd[1]: Failed to start Renew ACME certificate for <domain>.

This could be fixed by (1) erasing the existing account with mv /var/lib/acme/.lego/accounts/<hash> /var/lib/acme/.lego/accounts/<hash>.bak and (2) recreating the tmpfiles with systemd-tmpfiles --create. With this change, my LE account got recreated and since then my ACME cert renewal works fine.

Tbh I don't know the code enough to understand what exactly went wrong which is why I'm posting this observation here. We may want to document this in our manual (or catch this issue if possible in our code), but not sure if any of those suggestions is a good idea.

@m1cr0man
Copy link
Contributor Author

That definitely looks like a bug in Lego rather than in our script, given the accounts data was corrupt/broken. All the script/service does is create symlinks and copy files. It might be worth posting that in an issue over on go-acme/lego.

@Ma27
Copy link
Member

Ma27 commented Sep 15, 2020

It's a known problem it seems: go-acme/lego#1006

@mohe2015
Copy link
Contributor

mohe2015 commented Oct 3, 2020

@m1cr0man I forgot to update the dns records first so it failed to enable ssl for three subdomains (letsencrypt). After I fixed that I got "Error creating new order :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/"

There is a Failed Validation limit of 5 failures per account, per hostname, per hour. [...] Exceeding the Failed Validations limit is reported with the error message too many failed authorizations recently.

This doesn't make sense. Do you know whether there are multiple attempts and if how you can disable that?

@m1cr0man
Copy link
Contributor Author

m1cr0man commented Oct 3, 2020

@m1cr0man I forgot to update the dns records first so it failed to enable ssl for three subdomains (letsencrypt)
...
This doesn't make sense. Do you know whether there are multiple attempts and if how you can disable that?

Can you check how many times the acme-$domain.service was run, by looking at journalctl -u acme-$domain.service? It may be that the systemd service automatically retried. Since accounts are shared across certs you would easily rack up that 5 failures per hour. You will definitely see from the logs whether it was a systemd retry or some functionality within lego.

@worldofpeace worldofpeace added this to In progress in 20.09 Blockers via automation Oct 5, 2020
@worldofpeace worldofpeace removed this from Done in 20.09 Blockers Oct 5, 2020
@worldofpeace worldofpeace moved this from In progress to Done in 20.09 Blockers Oct 5, 2020
script = with builtins; concatStringsSep "\n" (mapAttrsToList (cert: data: ''
for fixpath in /var/lib/acme/${escapeShellArg cert} /var/lib/acme/.lego/${escapeShellArg cert}; do
if [ -d "$fixpath" ]; then
chmod -R 750 "$fixpath"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this makes regular files (e.g. /var/lib/acme/hydra.nixos.org/fullchain.pem) executable, which is a bit ugly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chmod -R u=rwX,g=rX,o= "$fixpath"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback + the hint. I'll open a PR if no one beats me to it :)

ckSJ/EkxuwT/ZYLqCAKSFGMlFhad9g1Zyvd67XgfZq5p0pJTtGxtn5j8QHy6PM6m
NbjvWnP8lDU8j2l3eSG58S14iGs=
-----END CERTIFICATE-----
'';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I care a lot anymore, but note that there is a reason why these were added here instead of being generated: If you substitute the test-certs derivation but build the ca-certificates.crt on a different host, you'll get mismatched certificates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't really follow you on this one? The CA certificates and the test certs are generated in the same derivation now, so they can't drift from each other. My understanding of the cache is that for any one execution of the tests all nodes will use the same nix store and cache and thus it doesn't matter which it chooses from (local build or cache) when setting the security.pki.certificateFiles. Unless I am missing something?

Copy link
Member

@aszlig aszlig Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the CA certs are copied, so while your machine might have test-certs version A, it might use ca-certificates.crt built (or in this case substituted) using test-certs version B.

edit: To illustrate this a bit better, consider this situation:

local store:

  • /nix/store/1234-test-certs - hash: abcd9876

remote store:

  • /nix/store/1234-test-certs - hash: dcba9182
  • /nix/store/5678-ca-certificates.crt

So if you're going to build the test, you substitute /nix/store/5678-ca-certificates.crt from the remote store and thus get the CA certificate from the test-certs derivation with content hash dcba9182 while the test is actually generating the server certificates from abcd9876.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I follow you now. I had to revise my understanding of derivations, and as the docs highlight the hash is based on the buildInputs, name and output name. And then the chaos you have explained can happen since the output isn't deterministic based on those properties, and caCertificates can drift.

I'm not keen to hard code the certs back in, but there really is no other solution right? I already read back on your commits to see how you tried to solve it before. I was trying to think of something involving builtins.mkHash + builtins.readFile but that still depends on the "broken" certs derivation. If only outputHashMode + outputHashAlgo could be set without outputHash itself, resulting in a generated outputHash based on the cert files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hang on - cacerts has preferLocalBuild set to true, so surely setting preferLocalBuild for test certs too will ensure consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's good to know. Putting them in separate files in the repo sounds like a good solution. I'll do that tonight, and I'll keep the minica based generator there as a solution for generating the SSL certs incase they must be regenerated in the future.

Copy link
Member

@aszlig aszlig Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1cr0man: Hm, one way to address that would be if minica would be fully deterministic when generating keys/certs, which would help in a lot of other scenarios when testing with TLS enabled.

edit: Lemme give this a try and make a PoC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here is the working PoC (checked several times with nix-build --check). However implementing this in a sane™ way is not so easy, since Go's crypto library tries to prevent determinism via randutil.MaybeReadByte.

So we either need to copy & paste a deterministic version (which just doesn't use randutil.MaybeReadByte) of rsa.GenerateMultiPrimeKey into minica or "simply" patch go (which I did in the PoC). The latter has the advantage that it would also work if we'd want to have eg. ECDSA certs, while we'd need to create another copy for the former.

Maybe a better way to implement this would be to find a crypto library which doesn't try to prevent determinism and use that to generate the certs/keys.

Of course, another question here is the way we expose this, since we certainly don't want people to use this for other purposes than testing but on the other side I think it's useful to have this generally available for all sorts of tests.

And the last issue - which certainly is a bigger one that's even more prevalent with the hardcoded certificates - is certificate expiry, which minica sets to 100 years for the CA root certificate and to 2 years and 30 days for every server certificate. One way out of this would be if run our VM tests with a fixed time in general, but that's something for another pull request... ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really just go back to having a .nix to regenerate the private keys and storing them in the repository. Lots of other tests do the same already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @emilazy , but not to just immediately drop @aszlig suggestion I think the better course of action there would be to make an issue over on minica (and maybe open a PR? I might try) for a way to deterministically generate test certs.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/lets-encrypt-on-20-09/9950/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects