Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/acme: Fixes for account creation and remove tmpfiles usage #106857

Merged
merged 9 commits into from Jan 27, 2021

Conversation

m1cr0man
Copy link
Contributor

@m1cr0man m1cr0man commented Dec 13, 2020

Motivation for this change

Closes #106565
Possibly closes some other ACME tickets related to JWS token errors too. A new systemd target named acme-account-$hash.target has been added, which is depended on (requiredBy + after) by all but 1 cert service associated with the same account. The one remainder is depended on (requires + before) to do the account creation.
Note that I changed the hash data for the account directory which will trigger new account creation on all systems. Given that rate limit issues should now be resolved, this will simply test that I did a good enough job of it ;)

Closes #106603
Umask is changed to 0023 to accommodate for lighttpd

Closes #103121
Remove systemd-tmpfiles dependency follows from this discussion and may solve has been reported to solve #103121. I was tempted to totally remove the use of tmpfiles but I couldn't find a good home for the webroot directory creation. Ideally, it should already be created by the user but since most users don't set a webroot at all and expect it to "just work" through httpd/nginx I left it in.

Docs update are primarily to cover follow-up in #81634 and fix the systemctl clean command.

In the test suite, I added a wait for "network-online.target" on the pebble server due to a one time test failure I had because the host wasn't reachable. I hope this resolves it, but it's hard to test.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@m1cr0man
Copy link
Contributor Author

Heads up @NixOS/acme

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/lets-encrypt-on-20-09/9950/13

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/migrating-letsencrypt-acme-certificates-from-another-server/9925/2

m1cr0man added a commit to m1cr0man/nixpkgs that referenced this pull request Dec 18, 2020
This means that all systems running from master will trigger
new certificate creation on next rebuild. Race conditions around
multiple account creation are fixed in NixOS#106857, not this commit.
@symphorien
Copy link
Member

I didn't read this in detail, but I wonder if using a systemd target is the right thing. Starting a target twice will not start the dependent services twice if I understand correctly.

Closes NixOS#106565
When generating multiple certificates which all
share the same server + email, lego will attempt
to create an account multiple times. By adding an
account creation target certificates which share
an account will wait for one service (chosen at
config build time) to complete first.
systemd-tmpfiles is no longer required for
most of the critical paths in the module. The
only one that remains is the webroot
acme-challenge directory since there's no
other good place for this to live and forcing
users to do the right thing alone will only
create more issues.
Closes NixOS#106603
Some webservers (lighttpd) require that the
files they are serving are world readable. We
do our own chmods in the scripts anyway, and
lego has sensible permissions on its output
files, so this change is safe enough.
The instructions on recreating the cert were missing --what=state.
Also added a note on ensuring the group of manual certs is correct.
@m1cr0man
Copy link
Contributor Author

Rebased

@flokli
Copy link
Contributor

flokli commented Dec 28, 2020

This is now missing all the cherry-picked-by lines (did it miss them before?).

Backport PRs should always be cherry-picked with the -x parameter.

Nvm, I confused this with #101482.

flokli pushed a commit that referenced this pull request Dec 28, 2020
This means that all systems running from master will trigger
new certificate creation on next rebuild. Race conditions around
multiple account creation are fixed in #106857, not this commit.

(cherry picked from commit e312039)
@m1cr0man
Copy link
Contributor Author

Uh, this isn't a backport? These PRs are fresh. I accidently cherry-picked and pushed your commit for #107769 in here, then reverted it.

I found a logical error in the bash script, but during
debugging I enabled command echoing and realised it
would be a good idea to have it enabled all the time for
ease of bug reporting.
@ryantm
Copy link
Member

ryantm commented Jan 20, 2021

This worked for me.

@Gaelan
Copy link
Contributor

Gaelan commented Jan 26, 2021

Anything that needs to be done before this is merged?

@m1cr0man
Copy link
Contributor Author

Nope, I don't have any more changes to make to this.

@ryantm
Copy link
Member

ryantm commented Jan 26, 2021

Can we get some more approvals from the ACME team?

@aanderse @andrew-d @arianvp @emilazy @flokli

@m1cr0man
Copy link
Contributor Author

I would also like to highlight the other smaller acme PRs I've had open for months. They too need to be merged for some essential fixes. At this point, I would prefer to see this PR merged first then I can work on rebasing and adjusting the other PRs as necessary.

@aanderse
Copy link
Member

Can we get some more approvals from the ACME team?

@aanderse @andrew-d @arianvp @emilazy @flokli

A number of people have reported that this PR fixes their issues. The proof is in the pudding, so I recommend merge 👍

@flokli flokli merged commit 1030745 into NixOS:master Jan 27, 2021
@flokli
Copy link
Contributor

flokli commented Jan 27, 2021

Let's get this in.

@rubenmoor
Copy link

rubenmoor commented Jan 28, 2021

EDIT: I am not sure whether or not the issue I describe below is related, as I am using nginx whereas #106603 is about lighttpd. I attached the relevant configuration at the end of this comment.

Referring to #106603, that this pull request addresses among others, I still get the 403, using nixops on an AWS EC2 instance.

nix-shell> nixops deploy -d staging

main............> Jan 28 22:11:30 serendipity acme-staging.rubenmoor.net-start[17863]: 2021/01/28 22:11:30 [INFO] [staging.rubenmoor.net] acme: Trying to solve HTTP-01
main............> Jan 28 22:11:36 serendipity acme-staging.rubenmoor.net-start[17863]: 2021/01/28 22:11:36 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/10432246082
main............> Jan 28 22:11:36 serendipity acme-staging.rubenmoor.net-start[17863]: 2021/01/28 22:11:36 [INFO] Unable to deactivate the authorization: https://acme-v02.api.letsencrypt.org/acme/authz-v3/10432246082
main............> Jan 28 22:11:36 serendipity acme-staging.rubenmoor.net-start[17863]: 2021/01/28 22:11:36 Could not obtain certificates:
main............> Jan 28 22:11:36 serendipity acme-staging.rubenmoor.net-start[17863]:         error: one or more domains had a problem:
main............> Jan 28 22:11:36 serendipity acme-staging.rubenmoor.net-start[17863]: [staging.rubenmoor.net] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: Invalid response from http://staging.rubenmoor.net/.well-known/acme-challenge/gLk9Zwwlz1RTXqPBWyOC2i0xuQTDaLtd7nUk89Fl6WA [54.93.104.227]: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n", url:
main............> Jan 28 22:11:36 serendipity systemd[1]: acme-staging.rubenmoor.net.service: Main process exited, code=exited, status=1/FAILURE
main............> Jan 28 22:11:36 serendipity systemd[1]: acme-staging.rubenmoor.net.service: Failed with result 'exit-code'.
main............> Jan 28 22:11:36 serendipity systemd[1]: Failed to start Renew ACME certificate for staging.rubenmoor.net.
main............> Jan 28 22:11:36 serendipity systemd[1]: acme-staging.rubenmoor.net.service: Consumed 97ms CPU time, received 14.5K IP traffic, sent 7.3K IP traffic.

This is my shell.nix to point to a local clone of nixpkgs: https://github.com/m1cr0man/nixpkgs.git:master

let
  # acme bugfix on 20.09
  pkgsSrc = ~/nixpkgs;

  pkgs = (import pkgsSrc {});
in
pkgs.stdenv.mkDerivation rec {
  name = "nixops-env";
  buildInputs = with pkgs; [ awscli pkgs.nixops ];
  shellHook = ''
    [ -f ~/.aws/vars ] && . ~/.aws/vars
    export NIX_PATH=${pkgs.path}:nixpkgs=${pkgs.path}:.
  '';
}

And I can confirm that, upon visiting http://staging.rubenmoor.net/.well-known/acme-challenge/gLk9Zwwlz1RTXqPBWyOC2i0xuQTDaLtd7nUk89Fl6WA in a browser, that there is 403.

My nixops deployment configuration:

    services = {
      nginx = { # provide acme challenge for ssl certificates
        enable = true;
        virtualHosts = {
          "acme-staging.rubenmoor.net" = {
            serverAliases = [ "staging.rubenmoor.net"];
            locations."/.well-known/acme-challenge" = {
              extraConfig = "autoindex on;";
              root = "/var/lib/acme/.challenges";
            };
            locations."/" = {
              return = "301 https://$host$request_uri";
            };
          };
        };
      };
    };
    security.acme = {
      acceptTerms = true;
      certs."staging.rubenmoor.net" = {
        webroot = "/var/lib/acme/.challenges";
        email = "myemail@gmail.com";
        # subdomains
        # extraDomainNames = [ "foo.domain.name"];
      };
    };
    systemd.services = {
      backend = {
        description = "Serendipity Platform server";
        wantedBy = [ "multi-user.target" ];
        after = [ "mysql.service" "acme-staging.rubenmoor.net.service" ];
        preStart = ''
          rm -rf /root/public
          mkdir -p /root/public/media
          cp ${frontend}/bin/* /root/public/
          staticdir=$(${backend}/bin/backend --get-data-file-path)
          cp -r $staticdir/static/* /root/public/
        '';
        serviceConfig = {
          ExecStart = ''
            ${backend}/bin/backend \
              --public-directory /root/public \
              --media-directory /root/public/media \
              --port 443 \
              --url https://staging.rubenmoor.net \
              --certificate-file /var/lib/acme/staging.rubenmoor.net/fullchain.pem \
              --key-file /var/lib/acme/staging.rubenmoor.net/key.pem
          '';
        };
      };
    };

@m1cr0man
Copy link
Contributor Author

Assuming you're using the default webroot for certs, can you check the permissions and ownership of /var/lib/acme/acme-challenges/.well-known/acme-challenge from acme-challenges up to the end? I suspect, due to this #106857 (comment), that your webroot permissions may be wrong. Please collect that and report back, we can work on a fix from there.

@rubenmoor
Copy link

Assuming you're using the default webroot for certs, can you check the permissions and ownership of /var/lib/acme/acme-challenges/.well-known/acme-challenge from acme-challenges up to the end? I suspect, due to this #106857 (comment), that your webroot permissions may be wrong. Please collect that and report back, we can work on a fix from there.

Owner and group are acme for all of these folders

/var/lib/acme/.challenges/.well-known/acme-challenge
/var/lib/acme/.challenges/.well-known/
/var/lib/acme/.challenges/
/var/lib/acme/

@m1cr0man
Copy link
Contributor Author

Ah of course, I'm just after reading your config thoroughly. Because you've manually configured your certs through nginx instead of using the builtins (like services.nginx.virtualHosts..enableACME), you have the wrong group configured for access. The fix is simple: Set security.acme.certs..group to nginx, or switch to using the builtins. There's a good example too on the manual.

@m1cr0man
Copy link
Contributor Author

Oh, also, you may need to manually set /var/lib/acme/.challenges to "nginx" group afterwards (since now it already exists, it won't get created with the right permissions). Alternatively delete the folder before you fix the config and the module will recreate it with the right permissions :)

@rubenmoor
Copy link

rubenmoor commented Jan 28, 2021

Ah of course, I'm just after reading your config thoroughly. Because you've manually configured your certs through nginx instead of using the builtins (like services.nginx.virtualHosts..enableACME), you have the wrong group configured for access. The fix is simple: Set security.acme.certs..group to nginx, or switch to using the builtins. There's a good example too on the manual.

Following your suggestion, now the folders all have owner/group set to acme/nginx. I get a time-out error now instead of 403.

Also, the configuration option security.acme.certs.<name>.group is listed in the manual. There is no hint, though, that the default isn't sufficient. The example that explicitly uses nginx does not set the option, either.

Instead it says

  # /var/lib/acme/.challenges must be writable by the ACME user
  # and readable by the Nginx user.
  # By default, this is the case.

Also, permissions are set to 754 throughout for all the folders mentioned above.

@rubenmoor
Copy link

Update: the timeout error is based on a different problem in my configuration. During the deployments I have changed the IP at least once, having my domain pointing nowhere. There might have been yet another issue in my security group where I was missing the .name in

        main.deployment.ec2.securityGroups = [ resources.ec2SecurityGroups.serendipityGroup.name ];

By the looks of it, I am stuck with one last issue and about to sort that out myself: I use nginx for ssl certs and my custom server called "backend" to run my app. This custom server is run as a service, thus run by the root user. However, backend does not find the ssl key file: "no keys found", even though I point it to the correct directory and file: /var/lib/acme/staging.rubenmoor.net/key.pem.

@rubenmoor
Copy link

rubenmoor commented Jan 29, 2021

Finally got my setup to work. Thanks for your advice, @m1cr0man .

Indeed, I needed to set security.acme.certs.<name>.group to nginx, given my setup. With this, I too can confirm that this pull request works as intended.

Unrelated, but ultimately, I had configure warp-tls correctly using the chain settings (tlsSettingsChain in Haskell)

@m1cr0man
Copy link
Contributor Author

Also, the configuration option security.acme.certs.<name>.group is listed in the manual. There is no hint, though, that the default isn't sufficient. The example that explicitly uses nginx does not set the option, either.

Glad you got sorted! This error in the docs is fixed in #100356 which hopefully I'll get merged soon.

@primeos
Copy link
Member

primeos commented Feb 10, 2021

FYI: Backport of this PR for NixOS 20.09: #112145 (needs reviewers / not yet merged).

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/trying-to-get-a-lets-encrypt-cert-to-renew-connection-refused/11009/2

@@ -248,13 +272,18 @@ let

# Working directory will be /tmp
script = ''
set -euo pipefail
set -euxo pipefail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, is this annoying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? I added it because debugging people's issues without the shell commands in the logs is quite difficult. Also so much functionality of cert renewal happens in bash. Personally I've found it quite useful.

Copy link
Member

@hmenke hmenke Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't debugging output only be enabled when you're, well, debugging?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well with acme, you know you want to debug only after the fact. besides, the amount of logs this generates is not that big...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment