Skip to content

glusterfs service: add support for TLS communication #27340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 21, 2017

Conversation

bachp
Copy link
Member

@bachp bachp commented Jul 12, 2017

Motivation for this change

This allows to easily setup glusterfs with TLS support. The user has to provide three files:

  • Private Key
  • Certificate
  • Certificate Authority

There are still some open points I would appreciate some feedback:

  1. Enabling TLS in GlusterFS works by placing the certificates in /etc/ssl and creating the marker file /var/lib/glusterd/secure-access. This is the same setup for Servers and Clients wanting to mount from a secure server. This PR only activates this for the server part. This means one can mount a TLS secured volume on a machine where the server is also running. To improve this the TLS option should probably be moved out of the service config, but I'm not sure how.
Things done
  • Tested using sandboxing
    (nix.useSandbox on NixOS,
    or option build-use-sandbox in nix.conf
    on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • Linux
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Fits CONTRIBUTING.md.

/cc @nh2

Sorry, something went wrong.

@mention-bot
Copy link

@bachp, thanks for your PR! By analyzing the history of the files in this pull request, we identified @nh2 to be a potential reviewer.


restartTriggers = [
config.environment.etc."ssl/glusterfs.pem".source
config.environment.etc."ssl/glusterfs.key".source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should probably not put the private key in to the nix store.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a better approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a better approach?

Giving it a path (that's not in the nix store).

For example, nixops has a keys functionality that places secret keys into a special directory that doesn't survive reboots.

''
else
''
rm -f /var/lib/glusterd/secure-access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting stuff in /var isn't soething that NixOS modules typically do, this behaviour must definitely be documented in the description of enableTLS below.

I guess it is OK to do it in this case because glusterfs leaves us no other choice: There's no other way to tell it to turn encryption on or off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to add the keys to the nixstore and don't manipulate the file in /var/lib/glusterd then there is actually nothing much this change does. As anyway everything (put keys somewhere, put activation file in /var) has to be done outside of nix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstood me: I think this PR is good, and that deleting secure-access here is OK. You just need to document that behaviour in the option description.

Also, having the pubkey in the store is OK, it's just the private key that should be a file path instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for clarifying :)

I would imagine instead of putting the private key into the config I would just put a path to a file.
What's the mechanism to make sure this gets symlinked to the correct location in etc? Also via config.environment.etc?

Is there a similar mechanism that could be used to put the file into /var?


restartTriggers = [
config.environment.etc."ssl/glusterfs.pem".source
config.environment.etc."ssl/glusterfs.key".source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a better approach?

Giving it a path (that's not in the nix store).

For example, nixops has a keys functionality that places secret keys into a special directory that doesn't survive reboots.

};

caCert = mkOption {
default = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some asserts here that ensure that when enableTLS is true, tlsKey and tlsKey and tlsPem are not null.

Better yet, make a tlsSettings option of type submodule that can be either null (for TLS being disabled), or a {} with the 3 options inside, each non-nullable.

An example of how to do that can be found here:

https://github.com/nh2/nixops-gluster-example/blob/8a859f0a702efd38bef9a202e1b031f1db6a44d3/modules/nh2-glusterfs-server.nix#L128-L160

@bachp
Copy link
Member Author

bachp commented Sep 10, 2017

I reworked the PR to make use of submodule as @nh2 suggested.

@joachifm joachifm added 0.kind: enhancement Add something new 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS labels Sep 17, 2017
@joachifm
Copy link
Contributor

Is this ready?

@bachp
Copy link
Member Author

bachp commented Sep 17, 2017

@joachifm from my point it is. I would appreciate some feedback from @nh2 tough.

caCert = mkOption {
default = null;
type = types.path;
description = "Path certificate authority used to signe the cluster certificates.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 small typos: Path to and signe -> sign.


tlsSettings = mkOption {
description = ''
Make the server communicat via TLS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo communicate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type = types.nullOr (types.submodule {
options = {
tlsKey = mkOption {
default = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to remove this default = null and the other two below.

Either we want that the entire tlsSettings dict is null, or we want that it is not null, and in that case we want that the user provides all 3 of tlsKey, tlsPem and caCert, so those should not be nullable.

environment.etc = mkIf (cfg.tlsSettings != null) {
"ssl/glusterfs.pem".source = cfg.tlsSettings.tlsPem;
"ssl/glusterfs.key".source = cfg.tlsSettings.tlsKey;
"ssl/glusterfs.ca".source = cfg.tlsSettings.caCert;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here I'm not quite sure what source exactly does.

Does it just point the symlink /etc/ssl/glusterfs.key -> theGivenPath? Or does it take a local path to a file, put it into the nix store, and create a symlink /etc/ssl/glusterfs.key -> /nix/store/thefile?

The latter would be bad, because that would make the private key world readable in /nix/store.

We definitely need the former, so that e.g. in the case of nixops we can obtain a symlink like /etc/ssl/glusterfs.key -> /var/run/keys/... where the target directory is the keys directory managed by nixops and not part of the store.

Can anybody shed light on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine it just creates the files end up in the nix store :(

lrwxrwxrwx 1 root root 28 Sep 17 17:33 glusterfs.ca -> /etc/static/ssl/glusterfs.ca
lrwxrwxrwx 1 root root 29 Sep 17 17:33 glusterfs.key -> /etc/static/ssl/glusterfs.key
lrwxrwxrwx 1 root root 29 Sep 17 17:33 glusterfs.pem -> /etc/static/ssl/glusterfs.pem

where:

lrwxrwxrwx 1 root root 50 Jan  1  1970 glusterfs.ca -> /nix/store/n9h1w46qjcbsrqpr38i7n6vdqb8cfj16-ca.pem
lrwxrwxrwx 1 root root 61 Jan  1  1970 glusterfs.key -> /nix/store/8skzrsdyx46kjzpz8m5kvj6y9x4qhqyy-cleopatra-key.pem
lrwxrwxrwx 1 root root 57 Jan  1  1970 glusterfs.pem -> /nix/store/54y38wlzipf142mxzki2qh3sz4hp4zxz-cleopatra.pem

I'm not sure how to do this properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for requiring the use of environment.etc? Compared to, say, using preStart or a dedicated unit for this purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original entry I have in my configuration.nix:

  services.glusterfs = {
    enable = true;
    tlsSettings = {
      tlsKey = /home/pascal/ca/cleopatra-key.pem;
      tlsPem = /home/pascal/ca/cleopatra.pem;
      caCert = /home/pascal/ca/ca.pem; 
    };
  };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you enclose the paths in quotes, so they're treated as string literals and not paths (which will be added to the store on eval)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My brief reading of the etc builder code suggests it'll take whatever you give it, so long as it looks like a filepath).

@@ -70,11 +128,12 @@ in
PIDFile="/run/glusterd.pid";
LimitNOFILE=65536;
ExecStart="${glusterfs}/sbin/glusterd -p /run/glusterd.pid --log-level=${cfg.logLevel} ${toString cfg.extraFlags}";
KillMode="process";
KillMode="control-group";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make this change part of this commit / PR, but do that separately.

I have a branch nh2-glusterfs-service-improvements that includes this commit to handle this.

My plan is that after your PR here has landed, we can merge the remaining improvements from my branch, and then do the gluster 3.12 upgrade (#29062).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point I will remove this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is that after your PR here has landed, we can merge the remaining improvements from my branch

I've PR'd the changes here now: #29868

@nh2
Copy link
Contributor

nh2 commented Sep 17, 2017

I've made a couple comments / requests, the most important one is to figure out whether or not the current approach puts the private key into the nix store, if yes we have to find a different way.

@nh2
Copy link
Contributor

nh2 commented Sep 17, 2017

Also note for myself: Once we've figured it out, I need to update my https://github.com/nh2/nixops-gluster-example/blob/8a859f0a702efd38bef9a202e1b031f1db6a44d3/modules/glusterfs-SSL-setup.nix#L87 which definitely has it in /nix/store, which is bad.

default = null;
type = types.path;
type = types.str;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not protect against accidentally copying things into the store. It still accepts path literals, as in /my/secret.key which will be copied on eval.

Verified

This commit was signed with the committer’s verified signature. The key has expired.
erictapen Kerstin
TLS settings are implemented as submodule.
@bachp
Copy link
Member Author

bachp commented Sep 17, 2017

@nh2 Thanks for your feedback. All comments are addressed.

@joachifm
Copy link
Contributor

joachifm commented Sep 17, 2017

Changing the option type won't protect against accidentally copying files into the Nix store.

@bachp
Copy link
Member Author

bachp commented Sep 17, 2017

@joachifm How you mean that? Like this I'm not able to pass a path to the option?

@joachifm
Copy link
Contributor

joachifm commented Sep 17, 2017

Two things: first, any value accepted by types.path is also accepted by types.str (the latter is a superset of the former). Secondly, Nix copies stuff into the store as a side-effect of evaluating them. This eval is different from module eval and may occur before option type checking, so you could still end up copying into the Nix store even if the module type-check eventually fails.

@joachifm
Copy link
Contributor

What you really want is a Nix level mechanism for preventing copies, I don't think you can easily achieve this via module options.

@joachifm
Copy link
Contributor

See https://github.com/NixOS/nixpkgs/blob/master/lib/types.nix#L167 for the definition of the types.path, you'll note it's just a string that begins with '/'.

@bachp
Copy link
Member Author

bachp commented Sep 17, 2017

@joachifm Just to clarify if:

  1. tlsKeyPath = /path/key.pem;
    a. copies the key to nix store if types.path
    b. gives an error of type missmatch if types.str;
  2. tlsKeyPath = "/path/key.pem"; creates a symlink from /etc/static/ssl/glusterfs.key -> /path/key.pem for both cases

You are saying in that case 1b fails but still has copied the key to the store?

Is case 2 always "safe" if I give it a string? And 1 is always "unsafe"?

@joachifm
Copy link
Contributor

By my analysis, 1 is always unsafe, and 2 is always safe. The copying that occurs in 1 is unrelated to the declared option type.

@joachifm
Copy link
Contributor

Well, I should say, safe in terms of what the Nix evaluator will do, the module could of course copy stuff in a different way (including it in writeText or what have you).

@joachifm
Copy link
Contributor

joachifm commented Sep 17, 2017

To clarify, the copying I'm worried about is what occurs when Nix evaluates /foo/bar/baz versus what happens when it evaluates "/foo/bar/baz".

@bachp
Copy link
Member Author

bachp commented Sep 17, 2017

@joachifm @nh2 So from what I hear I say we can drop 40e122e again as it doesn't add any safety.

@joachifm
Copy link
Contributor

@bachp yes, if safety is the only motivation for the change it can be dropped IMO.

@joachifm
Copy link
Contributor

joachifm commented Sep 17, 2017

I think the best you can do right now is warn/fail if a keyfile path points into the Nix store, as that indicates accidental capture. The key will have leaked but at least the user knows. Other than that I imagine you need a Nix level mechanism to really fix this (unless I'm taking crazy pills or overlooking something, which is certainly possible).

@bachp
Copy link
Member Author

bachp commented Sep 17, 2017

I dropped the second commit.

@nh2
Copy link
Contributor

nh2 commented Sep 18, 2017

@bachp I think we should keep the commit.

There's an important distinction here:

If you use nixops, then as you describe in #27340 (comment), method (1) copies the value into your local nix store. But afterwards, the evaluation will fail, and nixops will not get a chance to copy the world-readable key into the nix store of the remote machine(s).

That is much better, because that way the key leakage is confined to your own machine (the one you run nixops on), where you can easily undo it by deleting the built store path, and does not get deployed into the world.

And, as @joachifm says, it will ensure that the the user knows that they messed up (even though they still have to manually remove the key from their store).

@joachifm joachifm added this to the 17.09 milestone Sep 20, 2017
This pervents the user from accidently commiting the key to the nix store.
If providing a path instead of a string.
@bachp
Copy link
Member Author

bachp commented Sep 21, 2017

@joachifm @nh2 I re-added the commit. I think it is good to go now.

@joachifm joachifm merged commit c913f71 into NixOS:master Sep 21, 2017
@joachifm
Copy link
Contributor

Thank you

@nh2
Copy link
Contributor

nh2 commented Sep 21, 2017

Thanks!

@nh2
Copy link
Contributor

nh2 commented Oct 4, 2017

Posting a comment here that I had before in my custom config file:

Note: If the /var/lib/glusterd/secure-access thing breaks in the future and we get

error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number

in the server log, then that means that nixpkgs has switched the prefix localstatedir=/var to something different; the secure-access file must be under that prefix.

Just in case somebody needs to google that error message in the future.

nh2 added a commit to nh2/nixops-gluster-example that referenced this pull request Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
nh2 added a commit to nh2/nixops-gluster-example that referenced this pull request Oct 22, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
nh2 added a commit to nh2/nixops-gluster-example that referenced this pull request Oct 22, 2017

Verified

This commit was signed with the committer’s verified signature. The key has expired.
cole-h Cole Helbling
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: enhancement Add something new 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants