Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul: rewrite nixos module #107798

Closed
wants to merge 3 commits into from
Closed

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Dec 28, 2020

Motivation for this change

This is a major simplification of the consul service exposed in nixos.

The primary motivation was to match the systemd unit settings provided by Hashicorp here.

Additionally, this service was previously discovering host IP addresses in an unnecessarily complex way, when Consul supports using go-sockaddr templates in configuration.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 28, 2020

This PR has some backwards incompatible changes to the module's configuration. I am happy to adjust that if necessary.

@aanderse
Copy link
Member

The primary motivation was to match the systemd unit settings provided by Hashicorp here.

Great. Maybe you can submit a PR to actually include that systemd unit in the upstream code repository. If upstream maintains and ships a systemd unit we can utilize this unit via systemd.packages and automatically stay (mostly) in sync with the upstream unit.

@nh2
Copy link
Contributor

nh2 commented Dec 28, 2020

we can utilize this unit via systemd.packages and automatically stay (mostly) in sync with the upstream unit.

Is it explains somewhere how that works, e.g. will it still be overridable in NixOS?

@aanderse
Copy link
Member

I'm not sure. I don't see it in the manual... It is fairly straightforward, though. If pkgs.consul has $out/lib/systemd/system/*.service then these will be included in the resulting system if you specify systemd.packages = [ pkgs.consul ]; You can then specify customization/overrides like so:

systemd.services.consul = {
  restartTriggers = settingsFiles;
  serviceConfig = {
    ExecStart = [ "" "${cfg.package}/bin/consul agent"
      + concatMapStrings (file: " -config-file=${file}") settingsFiles ];
    StateDirectory = "consul";
    ExecStop = mkIf cfg.leaveOnStop "${cfg.package}/bin/consul leave";
  };
};

The idea is to get as much upstream as possible:

  • free review from upstream
  • minimize customization in NixOS module in favor of PRs upstream
  • easy to keep in sync with upstream unit
  • other distros benefit
  • etc...

As a simple example we want StateDirectory = "consul"; so we should propose the addition of this in the upstream unit, instead of the NixOS module.

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 28, 2020

The primary motivation was to match the systemd unit settings provided by Hashicorp here.

Great. Maybe you can submit a PR to actually include that systemd unit in the upstream code repository. If upstream maintains and ships a systemd unit we can utilize this unit via systemd.packages and automatically stay (mostly) in sync with the upstream unit.

I agree with you here. Is an upstream unit a blocker to get this PR merged? This service hasn't seen any maintenance in NixOS for a while, so I do think there's benefit to getting this in before an upstream unit is merged.

@aanderse
Copy link
Member

I agree with you here. Is an upstream unit a blocker to get this PR merged? This service hasn't seen any maintenance in NixOS for a while, so I do think there's benefit to getting this in before an upstream unit is merged.

Most certainly not, but if you circle back around to that later you get full bonus points ✨
You also might want to consider listing yourself as a maintainer in the NixOS module, if you have interest.

@nh2
Copy link
Contributor

nh2 commented Dec 28, 2020

Thanks for the PR! I will make a bit longer comment on the after work, if I can fit it in today.

Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consul: rewrite nixos module

This commit message will need some significant expansion:

I know multiple production setups where Consul is a critical infrastructure part that will go down if something in the Consul NixOS module breaks (including my own), so it would be great if every significant change to the module could be clearly outlined in the commit message.

I will try to point at some of them in the in-code comments with Significant enough for changelog entry.

The user-visible changes also need to go into a NixOS release notes entry (the equivalent of nixos/doc/manual/release-notes/rl-2103.xml).

I am also wondering if it makes sense to keep the old module for 1 release, in addition to the new module proposed here (e.g. keep the old one as services.consul_deprecated or so), so that people who use this in production servers have an easier migration path. I don't know if there is something similar in nixpkgs for service modules yet, I'm just quite sure that this will be a risky upgrade for my org at least. (So I'm not requesting this as a change here yet but would like to know what the PR author and other nixpkgs/consul users think.)

+ concatMapStrings (file: " -config-file=${file}") settingsFiles;
ExecReload = "${pkgs.coreutils}/bin/kill --signal HUP $MAINPID";
ExecStop = optionalString cfg.leaveOnStop "${cfg.package}/bin/consul leave";
DynamicUser = cfg.dropPrivileges;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the description of dropPrivileges should now be slightly adjusted to reflect this.

};

config = mkIf cfg.enable (
mkMerge [{

users.users.consul = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant enough for changelog entry


systemd.services.consul = {
wantedBy = [ "multi-user.target" ];
after = [ "network.target" ] ++ systemdDevices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this service was previously discovering host IP addresses in an unnecessarily complex way, when Consul supports using go-sockaddr templates in configuration

It's unclear to me whether go-sockaddr does the equivalent of the removed ++ systemdDevices, does anyone know more?

bindsTo = systemdDevices;
restartTriggers = [ config.environment.etc."consul.json".source ]
++ mapAttrsToList (_: d: d.source)
(filterAttrs (n: _: hasPrefix "consul.d/" n) config.environment.etc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant enough for changelog entry (the no-longer-using of consul.d)

In particular, this changes how users of the module need to write modular configs.

For example, a common way how we would define additional checks so far was to create more consul.d JSON files, e.g. like so:

{
  systemd.services.consul.restartTriggers = [
    config.environment.etc."consul.d/nginx-service.json".source
  ];

  environment.etc."consul.d/nginx-service.json" = {
    text = builtins.toJSON {
      service = {
        name = "nginx";
        checks = [
          {
            id = "nginx-status-page";
            name = "nginx status page";
            http = "http://localhost:80/nginx_status";
            interval = "1s";
            timeout = "1s";
          }
        ];
      };
    };
  };
}

I think we need to explain how this is to be translated to the new setup.

I guess there are 2 ways to do it:

  1. cfg.settingsFiles
  2. translating them to checks = [ ] as described in https://www.consul.io/docs/discovery/checks#multiple-check-definitions and putting them into cfg.settings.

Does one of these have more benefits / should we recommend one over the other?

For the second approach, does type = format.type; take care to concatenate lists if set from multiple different NixOS modules?

@@ -40,15 +29,6 @@ in
'';
};


webUi = mkOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of simply removing the config. options, I think it would be better to make use of them emit an explicit error message, that explains what to use instead (can be removed in the subsequent NixOS release).

@@ -40,15 +29,6 @@ in
'';
};


webUi = mkOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe nixos/tests/consul.nix needs to be adjusted to still evaluate given this change.

@nh2
Copy link
Contributor

nh2 commented Dec 28, 2020

The NixOS tests fail, which is misleading:

The Github action https://github.com/NixOS/nixpkgs/pull/107798/checks?check_run_id=1617881367 says

Success

Attempted: consul

The following builds were skipped because they don't evaluate on x86_64-linux: consul.passthru.tests

So it says success when in fact it wasn't successful.

What can we do against that in general?

From https://logs.nix.ci/?key=nixos/nixpkgs.107798&attempt_id=2ae55234-6480-4039-a2ae-1214c7ec5326 the actual failure output is:

Cannot nix-instantiate `consul.passthru.tests' because:
...
The option `services.consul.extraConfig' does not exist.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/can-we-make-ofborg-show-broken-tests-as-red/10717/1

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 28, 2020

@nh2 Thanks for your review! I think this PR is probably easier to review and get merged if I split it into a number of smaller ones:

  1. Use the systemd unit provided by upstream (well, the consul website) and move to dynamic user creation.
  2. Replace the ip usage with go-sockaddr templates.
  3. Deprecate extraConfig(Files)? in favor of settings(Files)?
  4. Remove the interface config altogether.
  5. Optionally remove the directory usage (I personally don't mind keeping that)

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 28, 2020

In light of that I'll close this one out and start on the split.

@cpcloud cpcloud closed this Dec 28, 2020
@cpcloud cpcloud deleted the consul-cleanup branch December 28, 2020 22:45
@aanderse
Copy link
Member

I am also wondering if it makes sense to keep the old module for 1 release, in addition to the new module proposed here (e.g. keep the old one as services.consul_deprecated or so), so that people who use this in production servers have an easier migration path.

Since NixOS is so flexible I don't think there is any value in that. You can just import the module from 20.09 and continue with that.

@nh2
Copy link
Contributor

nh2 commented Dec 28, 2020

Since NixOS is so flexible I don't think there is any value in that. You can just import the module from 20.09 and continue with that.

@aanderse Hmm, I guess that is true. You mean like in here, right?

@nh2
Copy link
Contributor

nh2 commented Dec 28, 2020

I think this PR is probably easier to review and get merged if I split it into a number of smaller ones

@cpcloud Not sure if it's easier/faster, as it'll require more roundtrips, and it won't allow people to see the entire plan at once as a single commit or multiple commits in a single PR would, but I'll try to review any approach you pick.

  1. Use the systemd unit provided by upstream (well, the consul website) and move to dynamic user creation.
  2. Replace the ip usage with go-sockaddr templates.
  3. Deprecate extraConfig(Files)? in favor of settings(Files)?
  4. Remove the interface config altogether.
  5. Optionally remove the directory usage (I personally don't mind keeping that)

Do you want to do the release notes at the end or also add them step by step (just so I know whether I should check for them in the individual PRs)?

@aanderse
Copy link
Member

@nh2 exactly 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants