Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/acme: Backport account missing fixes #107769

Merged
merged 4 commits into from Dec 28, 2020

Conversation

m1cr0man
Copy link
Contributor

Motivation for this change

Backport of #101482

It was requested by at least one user, and the module is still causing issues for users so this will probably not be the last backport of acme-related changes.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

This means that all systems running from master will trigger
new certificate creation on next rebuild. Race conditions around
multiple account creation are fixed in NixOS#106857, not this commit.

(cherry picked from commit e312039)
Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in ACME, but from #101482 (comment) I can confirm that applying these backports fixed the problem for me on 20.09.

@flokli
Copy link
Contributor

flokli commented Dec 28, 2020

@m1cr0man can you cherry-pick -x f71e439 into here?

@m1cr0man
Copy link
Contributor Author

Ah, will do. I was quick to just commit the suggested changes. Will revert that for commit consistency.

(cherry picked from commit f71e439)
@Ma27
Copy link
Member

Ma27 commented Jan 2, 2021

Okay, this broke my ACME again with the following error:

acme-<redacted>-start[1582]: 2021/01/02 11:55:15 Account <redacted> is not registered. Use 'run' to register a new account.

Fix was - again - to recreate /var/lib/acme and restart all affected services.

@nh2
Copy link
Contributor

nh2 commented Jan 2, 2021

Okay, this broke my ACME again with the following error:

Just for the record, that's the same error messgae which this PR fixed for me.

@andir
Copy link
Member

andir commented Jan 2, 2021

I can second @Ma27's result.. It just broke for me and when switching system generations it just hangs there :/

A dependency job for acme-finished-<domain1>.target failed. See 'journalctl -xe' for details.
A dependency job for acme-finished-<domain2>.target failed. See 'journalctl -xe' for details.

@nh2
Copy link
Contributor

nh2 commented Jan 2, 2021

@andir Does @Ma27's fix also fix it for you?

@m1cr0man Do you know if it's possible to make it such that we can somehow handle this case automatically? It's unfortunate that what fixes some system breaks it for others (at least once) -- and I imagine that even undoing the backport won't help, as people will still get the problem in the next NixOS release.

@m1cr0man
Copy link
Contributor Author

m1cr0man commented Jan 3, 2021

Okay, this broke my ACME again with the following error:

acme-<redacted>-start[1582]: 2021/01/02 11:55:15 Account <redacted> is not registered. Use 'run' to register a new account.

Fix was - again - to recreate /var/lib/acme and restart all affected services.

I don't have a good guess as to how this is occurring. It could be an instance of the account creation race conditions I suppose (do you have multiple certs with the same account email?). The code in this PR explicitly checks if your account registration files exist - so unless they got deleted between the conditional and lego running then it should just be working.

Hmm. I wonder if lego is creating files in the accounts folder even if it fails to create an account? This could happen if your very first renewal fails. Could you take a quick look at the accounts directory for the affected cert (if you still have it)?

@m1cr0man Do you know if it's possible to make it such that we can somehow handle this case automatically? It's unfortunate that what fixes some system breaks it for others (at least once) -- and I imagine that even undoing the backport won't help, as people will still get the problem in the next NixOS release.

I'm hopeful that #106857 will cover these cases, but failing that I don't have a root cause for this issue right now, so I can't fix it automatically Blowing away /var/lib/acme or any cert related files automatically would not be a sensible thing to do.

@pjones
Copy link
Contributor

pjones commented Jan 9, 2021

Came here to report yet another ACME bug. After I finally had it all working upgrading to the latest NIxOS 20.09 broke my certs again!

Jan 09 23:12:52 kilgrave systemd[1]: Starting Renew ACME certificate for [redacted]...
Jan 09 23:12:53 kilgrave acme-[redacted]-start[590]: 2021/01/09 23:12:53 Account [redacted] is not registered. Use 'run' to register a new account.

Is there anything I can provide from the logs or whatnot that will help?

@m1cr0man
Copy link
Contributor Author

m1cr0man commented Jan 9, 2021

Came here to report yet another ACME bug. After I finally had it all working upgrading to the latest NIxOS 20.09 broke my certs again!

Jan 09 23:12:52 kilgrave systemd[1]: Starting Renew ACME certificate for [redacted]...
Jan 09 23:12:53 kilgrave acme-[redacted]-start[590]: 2021/01/09 23:12:53 Account [redacted] is not registered. Use 'run' to register a new account.

Is there anything I can provide from the logs or whatnot that will help?

Yes, can you please list the contents of /var/lib/acme/.lego/accounts/${accountHash}/ and subdirectories for the failing cert? You can get the path by doing systemctl cat acme-$cert.service | grep accounts. Also can you please state if you have multiple certs configured on your system?

@pjones
Copy link
Contributor

pjones commented Jan 9, 2021

[root@kilgrave:~]# ls -lR /var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a
/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a:
total 4
drwx------ 3 acme wwwrun 4096 Jan  9 23:12 acme-v02.api.letsencrypt.org

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org:
total 4
drwx------ 3 acme wwwrun 4096 Jan  9 23:12 domains@redacted

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org/domains@redacted:
total 8
-rwx------ 1 acme wwwrun  215 Jan  9 23:12 account.json
drwx------ 2 acme wwwrun 4096 Jan  9 23:12 keys

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org/domains@redacted/keys:
total 4
-rwx------ 1 acme wwwrun 227 Jan  9 23:12 domains@redacted.key

This machine has two certificates for two different domains using the same email address. On this machine I also use DNS to verify with ACME and not HTTP.

@m1cr0man
Copy link
Contributor Author

I half expect this to happen with multiple accounts using the same name. Since the check for accounts is a basic "ls" of the accounts folder, if multiple renews run at once (with the same account config) then one of them is bound to create something in the folder and make the other think the account already exists, even if it doesn't really. You might think a more advanced check for the account.json and keys would fix it but still you run into race conditions which have been recorded plenty of times in other tickets. The real solution is the "accounts" targets created in #106857

If you run the failed renew service again, does it work?

@pjones
Copy link
Contributor

pjones commented Jan 10, 2021

@m1cr0man It looks like running the service again fixes the issue. I walked away from my computer for a bit and when I returned the services were fine. The normal timer must have kicked them off and resolved the issue.

I'd like to offer whatever help I can provide to get this all nailed down. For my own sanity I really need updates from NixOS 20.09 to be stable. I'm happy to write tests, help work on the ACME module, or whatever else needs to be done.

@m1cr0man
Copy link
Contributor Author

Honestly? I don't think there's much to worry about. The test suite is very comprehensive and in #106857 I have added a race condition test, which I'm now certain was the cause of your issue. At this point I fully plan to back port that PR too, so that the likes of yourself can benefit from it. I would appreciate a 👍 on that PR if you can :)

@pjones
Copy link
Contributor

pjones commented Jan 11, 2021

@m1cr0man

I figured that since this ACME issue was annoying but harmless I could deploy to my production servers. Of course that was a mistake.

On a server with a single certificate I'm getting this:

Jan 11 16:39:50 moriarty systemd[1]: Starting Renew ACME certificate for redacted.com...
Jan 11 16:39:51 moriarty acme-redacted.com-start[5092]: 2021/01/11 16:39:51 [INFO] [redacted.com] acme: Trying renewal with 727 hours remaining
Jan 11 16:39:51 moriarty acme-redacted.com-start[5092]: 2021/01/11 16:39:51 [INFO] [redacted.com] acme: Obtaining bundled SAN certificate
Jan 11 16:39:51 moriarty acme-redacted.com-start[5092]: 2021/01/11 16:39:51 acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: JWS verification error, url:
Jan 11 16:39:51 moriarty systemd[1]: acme-redacted.com.service: Main process exited, code=exited, status=1/FAILURE

On servers that have more than one cert they are failing with the "account is not registered" issue. Trying to run them one at a time does not resolve the error.

What info do you need to help me get this sorted out?

@m1cr0man
Copy link
Contributor Author

JWS verification error is being tracked in #101445, and I just discovered a graceful fix (see here). Please check that your account key is not corrupt first, and that its modify time is close to the account.json, as it will help us investigate this issue further.

As for the not registered error - if you are able to reproduce it (aka run one of the services and still get the same error) then please send me the contents of the same folders as last time.

@pjones
Copy link
Contributor

pjones commented Jan 11, 2021

@m1cr0man

On the server with multiple certificates, trying to run a single renewal fails with the "account not registered error". This is what the account directory looks like for that service:

$ sudo ls -lR /var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a
/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a:
total 4
drwx------ 3 acme wwwrun 4096 Jan 11 15:46 acme-v02.api.letsencrypt.org

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org:
total 4
drwx------ 3 acme wwwrun 4096 Jan 11 15:46 domains@redacted

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org/domains@redacted:
total 4
drwx------ 2 acme wwwrun 4096 Jan 11 15:46 keys

/var/lib/acme/.lego/accounts/976a0a17bb6f6b07707a/acme-v02.api.letsencrypt.org/domains@redacted/keys:
total 4
-rw-r----- 1 acme wwwrun 227 Jan 11 15:46 domains@redacted.key

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants