Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/acme: adjust renewal timer options #80856

Merged
merged 1 commit into from Mar 3, 2020
Merged

Conversation

emilazy
Copy link
Member

@emilazy emilazy commented Feb 23, 2020

Motivation for this change

The current weekly setting causes every NixOS server to try to renew its certificate at midnight on the dot on Monday. This contributes to the general problem of periodic load spikes for Let's Encrypt; NixOS is probably not a major contributor to that problem, but we can lead by example by picking good defaults here.

The values here were chosen after consulting with @yuriks, an SRE at Let's Encrypt:

  • Randomize the time certificates are renewed within a 24 hour period.

  • Check for renewal every 24 hours, to ensure the certificate is always renewed before an expiry notice is sent out.

  • Increase the AccuracySec (thus lowering the accuracy(!)), so that systemd can coalesce the renewal with other timers being run.

    (You might be worried that this would defeat the purpose of the time skewing, but systemd is documented as avoiding this by picking a random time.)

For more information justifying these changes, see the following documents:

I also added trivial commits to expose the --must-staple option and allow users to specify additional custom renewal options, as free goodies to encourage merging :p

I've tested these changes on my server, and everything seems fine, but I'm just one person; please do look over this carefully!

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@emilazy
Copy link
Member Author

emilazy commented Feb 23, 2020

Confirmed that this still passes the acme tests for me.

@aanderse
Copy link
Member

aanderse commented Feb 23, 2020

@emilazy I like how the first commit is nice and clean, simple, documented, and will be easy to test + validate. I like how in your backport you only include this commit. Thanks for doing the research and reaching out to the Let's Encrypt team!

I think it will be pretty easy to have the collective of relevant committers agree to merge the first commit. The other 2 commits might require some more discussion, and could potentially hold this PR up, which would hold up the backport PR. Given the stage we're at (with regards to the 20.03 release) I might recommend splitting this PR up. Just a recommendation.

@m1cr0man
Copy link
Contributor

The first two look good to me! I probably need this on my own setup with the 28 certs I'm trying to validate there, to avoid rate limits 😛

As for the additionalLegoRenewFlags, I actually had something similar (extraFlags) before I made the PR for lego (#77578, actually it was here) . I removed it because I felt it was too solution specific, and if we were to change acme clients again, this flag would have to be deprecated. With the other options like mustStaple and dnsPropagationCheck, at least if the next client has those features we can rewrite the arguments, but if people have random extra arguments then we can only break their configuration.

I am not against the idea either, for power users there is certainly desire to allow for custom flags to the service, but they run the above risk of it breaking some day. Would love to hear some other opinions on the matter. 🙂

Copy link
Member

@aanderse aanderse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think these changes are good 👍 My comment on code complexity here isn't really because of your code, but I think here and now might be the time to consider if we can simplify the command line argument logic at all... if anyone is up to the challenge, or has any great idea.

nixos/modules/security/acme.nix Outdated Show resolved Hide resolved
runOpts = escapeShellArgs (globalOpts ++ [ "run" ] ++ certOpts);
renewOpts = escapeShellArgs (globalOpts ++
[ "renew" "--days" (toString cfg.validMinDays) ] ++
certOpts ++ data.additionalLegoRenewFlags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globalOpts, certOpts, runOpts, and renewOpts - each of which is determined by a load of module options. I worry the code is becoming too complicated to properly test against all potential regressions on future changes.

Anyone have any ideas to simplify this, all while adding the flexibility @emilazy has added here? 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent an unreasonable amount of time thinking about this already. There's just a lot of different ways to configure your certs (multiple places for email, implicit selection between dns and http for example). Maybe using some of the lib.optional* functions would help clean it up a bit. The fact that you need 2 separate strings for "run" and "renew" is a bit annoying too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have good ideas here, other than for lego to support proper configuration files itself or something. I currently have to do

{
  # TODO: DNS validation
  security.acme = {
    email = "...";
    acceptTerms = true;
    certs = let cfg = {
      user = "nginx";
      group = "nginx";
      mustStaple = true;
    }; in {
      "emily.moe" = cfg // { extraDomains = { "patchouli.emily.moe" = null; }; };
      "mew.build" = cfg;
    };
  };
}

to share options between my certificates, so the current configuration model certainly isn't ideal.

@emilazy
Copy link
Member Author

emilazy commented Feb 23, 2020

Updated to split out the Must-Staple/extra flags commits into #80900, and also divide the 24h AccuracySec by the number of certificates, so that if you do have, say, 28 certificates, the renew requests don't all get coalesced into the same time.

nervously eyes https://github.com/NixOS/nixpkgs/pulls/emilazy, waiting for it to grow pagination...

@emilazy
Copy link
Member Author

emilazy commented Feb 23, 2020

An explanation for the AccuracySec change, in case you find the timers as confusing as I do:

17:35 <m1cr0m4n> Hey emilazy are you about? Just wanted to ask about
the maths on AccuracySec. If I understand this correctly, does lowering
the value for more certs not mean that they will run closer to the
same time, rather than more spread out throughout a day?
17:35 <emily> m1cr0m4n: no: the larger AccuracySec is, the larger
the allowed skew
17:35 <emily> it's a really confusing name, but check the systemd.timer
manpage
17:35 <emily> AccuracySec = 1s means "the time must be accurate to
within 1s of what is specified"
17:36 <emily> AccuracySec = 24h means "the time can vary within a
24h period of what's determined"
17:36 <m1cr0m4n> Right but, in this situation do you actually want
to make it more accurate? I don't fully understand why you would
17:36 <emily> so if you have 3 certs, it'll pick 3 random times of
the day, and then coalesce them within 8h periods
17:36 <emily> m1cr0m4n: let's say you have 100 certificates that all
expire around the same time: hammering let's encrypt for all 100 of
them at once is bad
17:36 <emily> doing it 100 separate times throughout the day is better
17:37 <emily> they'll still potentially coalesce statistically due
to the random placement
17:37 <emily> but basically we want to coalesce with other periodic
timers for power management reasons, but not coalesce with other acme
renewals for load management reasons
17:37 <emily> (yuriks suggested this approach fwiw)
17:37 <m1cr0m4n> yes I agree! I just don't get how that is happening
here. The math is diving 24h by the number of certs, so you're gonna
have like 24h/100 = < an 15 mins AccuracySec between certs.
17:38 <emily> so, the cert renewal checks are initially set to run
"daily", which means at midnight. If we didn't do any AccuracySec
or skewing, it would renew all 100 at midnight. adding the 24h skew
means that all 100 will be renewed at random times throughout the day
17:38 <emily> adding AccuracySec = 24h means that it'll notice
that all of these renewal timers are within the same 24h period,
and coalesce them
17:38 <emily> thus defeating the skew (it's still skewed relative to
other NixOS systems, but the certs themselves don't get skewed)
17:39 <emily> dividing 24h by the number of certs means that we
"bucket" them approximately according to the number of certs
17:40 <m1cr0m4n> Ok yeah..I think I understand :P Sorry, very confusing
time maths
17:41 <emily> I totally agree
17:41 <emily> mind if I post this log to the PR to help others?
17:41 <m1cr0m4n> Fire away! :)

Copy link
Contributor

@m1cr0man m1cr0man left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. This is one heck of a module now, I'd love to try my hand at simplifying it at some point.

@arianvp
Copy link
Member

arianvp commented Feb 24, 2020

At least we have a good chunk of people who are interested in this module now. I'm happy to set up some brainstorming session (Perhaps we can create an IRC channel; #nixos-acme) to think how we can tame this code a bit.

Copy link
Member

@arianvp arianvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@arianvp
Copy link
Member

arianvp commented Feb 24, 2020 via email

nixos/modules/security/acme.nix Outdated Show resolved Hide resolved
nixos/modules/security/acme.nix Outdated Show resolved Hide resolved
The current weekly setting causes every NixOS server to try to renew
its certificate at midnight on the dot on Monday. This contributes to
the general problem of periodic load spikes for Let's Encrypt; NixOS
is probably not a major contributor to that problem, but we can lead by
example by picking good defaults here.

The values here were chosen after consulting with @yuriks, an SRE at
Let's Encrypt:

* Randomize the time certificates are renewed within a 24 hour period.

* Check for renewal every 24 hours, to ensure the certificate is always
  renewed before an expiry notice is sent out.

* Increase the AccuracySec (thus lowering the accuracy(!)), so that
  systemd can coalesce the renewal with other timers being run.

  (You might be worried that this would defeat the purpose of the time
  skewing, but systemd is documented as avoiding this by picking a
  random time.)
@lukateras
Copy link
Member

@GrahamcOfBorg build nixosTests.acme

@lukateras
Copy link
Member

Passes nixosTests.acme and works as expected locally! :)

@aanderse Now that numCerts == 0 condition is out, is this good to merge?

Copy link
Member

@aanderse aanderse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yegortimoshenko yeah, I'm a fan of this change 👍 I also think we should merge the backport, so long as the release managers agree.

EDIT: has someone addressed the test failure? I thought there was a conversation about that...

@lukateras
Copy link
Member

lukateras commented Feb 29, 2020

EDIT: has someone addressed the test failure? I thought there was a conversation about that...

Couldn't get it to fail locally; https://github.com/NixOS/nixpkgs/pull/80856/checks?check_run_id=476781247 also passes. https://github.com/NixOS/nixpkgs/pull/80856/checks?check_run_id=476784925 (another attempt) does fail. Looks like a race condition? Don't think that it's introduced by this PR.

@aanderse
Copy link
Member

I thought @m1cr0man and @arianvp had some insight into the failed test... or maybe it was @flokli?

@worldofpeace
Copy link
Contributor

This is a very nice change with well substantiated evidence for improvement. Backport is fine with me.

@flokli
Copy link
Contributor

flokli commented Mar 2, 2020

I thought @m1cr0man and @arianvp had some insight into the failed test... or maybe it was @flokli?

No, I mostly grumbled about the acme tests (in nixpkgs) not being very understandable currently, and requiring a bit more love - no knowledge about why they might or might not be flaky.

@lukateras lukateras merged commit 31aefc7 into NixOS:master Mar 3, 2020
@emilazy emilazy deleted the adjust-acme branch March 3, 2020 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants