Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch #72603

Merged
merged 4 commits into from Nov 11, 2019

Conversation

flokli
Copy link
Contributor

@flokli flokli commented Nov 2, 2019

Motivation for this change
nixos/ceph: create /etc/ceph and /var/lib/ceph via tmpfiles

We seem to be relying on those being present during runtime anyways.

nixos/ceph: run unprivileged, use StateDirectory and tmpfiles, don't pass extraServiceConfig

Don't pass user and group to ceph, and rely on it to drop ceps, but let
systemd handle running it as the appropriate user.

Use StateDirectory to create directories in
/var/lib/ceph/${daemonType}/${clusterName}-${daemonId}.

There previously was a condition on daemonType being one of mds,mon,rgw
or mgr. We only instantiate makeServices with these types, and "osd" was
special.

In the osd case, test examples suggest it'd be in something like
/var/lib/ceph/osd/ceph-${cfg.osd0.name} - so it's not special at all,
but exactly the same. Worst case, we'd end up with a directory we're
simply not using.

Unfortunately, that's not enough, as some of these directories already
need to exist during initialization, so we use some tmpfiles rules to
ensure they already exist on boot, too.

After that, we however can remove all the manual steps needed to create
these directories.

This also inlines the extraServiceConfig into the makeService function,
as we have conditionals depending on daemonType there anyways.
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
Notify maintainers

cc @

Copy link
Contributor

@lejonet lejonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong with these changes, I've been meaning to refactor the module similar to this as the module in its initial shape was "just" mapping from the official systemd .service files.
This takes the module a step closer being better integrated into both systemd and nixos.

Comment on lines 409 to 410
++ (map (daemonName: "d /var/lib/ceph/mgr/${cfg.global.clusterName}-${daemonName} - ceph ceph - -") cfg.mgr.daemons)
++ (map (daemonName: "d /var/lib/ceph/osd/${cfg.global.clusterName}-${daemonName} - ceph ceph - -") cfg.osd.daemons);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing a map for the mon daemons, they follow the same pattern as the other daemons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are somewhat redundant with the StateDirectory directives in the individual units, which should be created by systemd when starting the units.

However, as some work needs to be done to initialize the cluster before these units can be started, just relying on StateDirectory isn't enough, as the initialization tools would fail otherwise.

I didn't want to duplicate everything from StateDirectory - happy for better suggestions w.r.t. cluster initialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the mon daemons are the most important daemons in the cluster. They are the first that need to be initialized in a cluster, so I think that we sadly need to duplicate everything from StateDirectory in tmpfiles too.

@flokli
Copy link
Contributor Author

flokli commented Nov 8, 2019

@lejonet you're fine with merging this in?

@srhb
Copy link
Contributor

srhb commented Nov 8, 2019

Does this mean mean any user facing changes if say the directories are nukes and the daemon restarted? Particularly I'd like to know whether recreation is now tied to activation rather than unit restart.

@flokli
Copy link
Contributor Author

flokli commented Nov 8, 2019

Both the StateDirectory= statement, and the tmpfiles rules take care of creating the directory if it doesn't exist.

The StateDirectory= will take care of the directories to be present on a (re)start - I only created the tmpfile rules to support the initialization phase, so directories already exist during initialization, without the services being started.

The initialization process currently is a very manual thing by itself, and apparently only "documented" by the nixos vm tests. You first switch to a generation with available, but disabled ceph services, do the initialization, start services, then switch to a generation autostarting these services. We might want to change this in the future, to provide a better out-of the box experience.

One option would be to have some generators generating individual ceph-osd services that do the initialization and startup depending on the actual disks being attached, ConditionPathExists for all the management services requiring a keyring etc. etc.… But that might be something for upstream too.

@srhb I'm fine with also removing the tmpfile rules and adding back the mkdir -p commands in the ceph tests doing the initialization, if you prefer that.

@lejonet
Copy link
Contributor

lejonet commented Nov 9, 2019

Both the StateDirectory= statement, and the tmpfiles rules take care of creating the directory if it doesn't exist.

The StateDirectory= will take care of the directories to be present on a (re)start - I only created the tmpfile rules to support the initialization phase, so directories already exist during initialization, without the services being started.

The initialization process currently is a very manual thing by itself, and apparently only "documented" by the nixos vm tests. You first switch to a generation with available, but disabled ceph services, do the initialization, start services, then switch to a generation autostarting these services. We might want to change this in the future, to provide a better out-of the box experience.

First and foremost, the initialisation process is not documented by us because the tests use the manual deploy from here.

Secondly, its mainly out of convenience that we disable autostart from the beginning, so that the testlog isn't spammed with failed startups of daemons before we've even started the whole initialization process (honestly tho, the comment could be better at explaining that to be completely fair). Its not a requirement to do so, just makes the log easier to read for the one running the test.

Thirdly, the way you initialize stuff in a ceph cluster highly depends on how the sysadmins want to do it, there is at least 3 different ways to do it (with the manual one being the one that is easiest to put in a test, because it doesn't depend on a specific ceph tool or other automation tool). The ceph module was written to accomodate the running of a cluster, and leaving all the initialization up to the sysadmins to fix before enabling the running of daemons.
Simply because outside the very uncommon and developer-centric case of a single node cluster, there isn't really any way for us to do the full initialization in units PreStart or similar activationscripts. That is unless the user wants to run a cluster without auth, but then they are, from the view of upstream, kinda on their own because that isn't a recommended way to run a cluster, especially not in a production environment.
With that said tho there is nothing stopping us from creating an option which enables an activationscript that sets up a single-node cluster so that at least those that want to use that gets a nice and fully automated way to set up a cluster.

One option would be to have some generators generating individual ceph-osd services that do the initialization and startup depending on the actual disks being attached, ConditionPathExists for all the management services requiring a keyring etc. etc.… But that might be something for upstream too.

Sadly, the initialization is quite stateful so its most likely not possible for us to automate the initialization process ourselves, especially as cephx (auth) is default nowdays and its not recommended to run a cluster without it (which is what would be required for us to be able to automate the initialization so to speak).
Even "just" initializing an OSD requires either a admin keyring or bootstrap-osd keyring to be present on a system.
However, if we completely ignore the whole auth part, utilizing ceph-volume tool we could be able to automate the process of at least OSDs very easily, because ceph-volume takes care of all that if you just give it the proper datapaths to be used for the OSD.

@srhb I'm fine with also removing the tmpfile rules and adding back the mkdir -p commands in the ceph tests doing the initialization, if you prefer that.

We seem to be relying on those being present during runtime anyways.
…pass extraServiceConfig

Don't pass user and group to ceph, and rely on it to drop ceps, but let
systemd handle running it as the appropriate user.

This also inlines the extraServiceConfig into the makeService function,
as we have conditionals depending on daemonType there anyways.

Use StateDirectory to create directories in
/var/lib/ceph/${daemonType}/${clusterName}-${daemonId}.

There previously was a condition on daemonType being one of mds,mon,rgw
or mgr. We only instantiate makeServices with these types, and "osd" was
special.
In the osd case, test examples suggest it'd be in something like
/var/lib/ceph/osd/ceph-${cfg.osd0.name} - so it's not special at all,
but exactly like the pattern for the others.

During initialization, we also need these folders, before the unit is
started up. Move the mkdir -p commands in the vm tests to the line
immediately before they're required.
This prevents services to be started before they're initialized, and
renders the `systemd.targets.ceph.wantedBy = lib.mkForce [];` hack in
the vm tests obsolete - The config now starts up ceph after a reboot,
too.

Let's take advantage of that, crash all VMs, and boot them up again.
@flokli
Copy link
Contributor Author

flokli commented Nov 9, 2019

I agree automating parts of the initialization is tricky - reading up on the manual installation guide, they document the state directories should be created during initialization, so I'm fine with having them in our initialization aswell, instead of in a tmpfiles rule.

I tinkered a bit with ConditionPathExists, and it's indeed a much nicer way to prevent services from starting up while being not initialized - even better, the config now is reboot-able, so we can forcibly crash all VMs and ensure the cluster comes back up again :-)

@flokli flokli changed the title nixos/ceph: run unprivileged, use StateDirectory and tmpfiles, don't pass extraServiceConfig nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch Nov 9, 2019
Copy link
Contributor

@lejonet lejonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its much cleaner to ensure that a daemon doesn't start when not initialized by actually coupling the unit with its state directory than simply let the daemon crash.

Also as @flokli stated in a discussion on IRC, systemctl status will explicitly tell the user now why it didn't start, instead of hoping that the user can deduce that from the errors.

@flokli
Copy link
Contributor Author

flokli commented Nov 9, 2019

@GrahamcOfBorg test ceph-single-node ceph-multi-node

@flokli
Copy link
Contributor Author

flokli commented Nov 9, 2019

@GrahamcOfBorg test ceph-single-node ceph-multi-node

@flokli
Copy link
Contributor Author

flokli commented Nov 10, 2019

I added some more cleanup to the ceph derivation itself in #73187.

@flokli flokli mentioned this pull request Nov 11, 2019
10 tasks
@srhb
Copy link
Contributor

srhb commented Nov 11, 2019

This looks good to me! I am a little concerned about the tmpfiles setup, because I've previously been bitten by the impedance mismatch in initialization in systemd units versus initialization during activation. That said, this isn't the first module to start doing it this way, and I'm not sure I actually have a coherent criticism of it yet, so let's just see how it works out.

I might be able to find time to deploy this to our test cluster soonish, but probably not in time to hold up the merge.

@flokli flokli merged commit 60390c8 into NixOS:master Nov 11, 2019
@flokli flokli deleted the ceph-tmpfiles branch November 11, 2019 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants