nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch #72603

flokli · 2019-11-02T17:41:10Z

Motivation for this change

nixos/ceph: create /etc/ceph and /var/lib/ceph via tmpfiles

We seem to be relying on those being present during runtime anyways.

nixos/ceph: run unprivileged, use StateDirectory and tmpfiles, don't pass extraServiceConfig

Don't pass user and group to ceph, and rely on it to drop ceps, but let
systemd handle running it as the appropriate user.

Use StateDirectory to create directories in
/var/lib/ceph/${daemonType}/${clusterName}-${daemonId}.

There previously was a condition on daemonType being one of mds,mon,rgw
or mgr. We only instantiate makeServices with these types, and "osd" was
special.

In the osd case, test examples suggest it'd be in something like
/var/lib/ceph/osd/ceph-${cfg.osd0.name} - so it's not special at all,
but exactly the same. Worst case, we'd end up with a directory we're
simply not using.

Unfortunately, that's not enough, as some of these directories already
need to exist during initialization, so we use some tmpfiles rules to
ensure they already exist on boot, too.

After that, we however can remove all the manual steps needed to create
these directories.

This also inlines the extraServiceConfig into the makeService function,
as we have conditionals depending on daemonType there anyways.

Things done

Notify maintainers

cc @

lejonet

I don't see anything wrong with these changes, I've been meaning to refactor the module similar to this as the module in its initial shape was "just" mapping from the official systemd .service files.
This takes the module a step closer being better integrated into both systemd and nixos.

lejonet · 2019-11-02T22:24:32Z

nixos/modules/services/network-filesystems/ceph.nix

+    ++ (map (daemonName: "d /var/lib/ceph/mgr/${cfg.global.clusterName}-${daemonName} - ceph ceph - -") cfg.mgr.daemons)
+    ++ (map (daemonName: "d /var/lib/ceph/osd/${cfg.global.clusterName}-${daemonName} - ceph ceph - -") cfg.osd.daemons);


You're missing a map for the mon daemons, they follow the same pattern as the other daemons.

These are somewhat redundant with the StateDirectory directives in the individual units, which should be created by systemd when starting the units.

However, as some work needs to be done to initialize the cluster before these units can be started, just relying on StateDirectory isn't enough, as the initialization tools would fail otherwise.

I didn't want to duplicate everything from StateDirectory - happy for better suggestions w.r.t. cluster initialization.

Well, the mon daemons are the most important daemons in the cluster. They are the first that need to be initialized in a cluster, so I think that we sadly need to duplicate everything from StateDirectory in tmpfiles too.

flokli · 2019-11-08T22:02:19Z

@lejonet you're fine with merging this in?

srhb · 2019-11-08T22:16:10Z

Does this mean mean any user facing changes if say the directories are nukes and the daemon restarted? Particularly I'd like to know whether recreation is now tied to activation rather than unit restart.

flokli · 2019-11-08T23:33:04Z

Both the StateDirectory= statement, and the tmpfiles rules take care of creating the directory if it doesn't exist.

The StateDirectory= will take care of the directories to be present on a (re)start - I only created the tmpfile rules to support the initialization phase, so directories already exist during initialization, without the services being started.

The initialization process currently is a very manual thing by itself, and apparently only "documented" by the nixos vm tests. You first switch to a generation with available, but disabled ceph services, do the initialization, start services, then switch to a generation autostarting these services. We might want to change this in the future, to provide a better out-of the box experience.

One option would be to have some generators generating individual ceph-osd services that do the initialization and startup depending on the actual disks being attached, ConditionPathExists for all the management services requiring a keyring etc. etc.… But that might be something for upstream too.

@srhb I'm fine with also removing the tmpfile rules and adding back the mkdir -p commands in the ceph tests doing the initialization, if you prefer that.

lejonet · 2019-11-09T09:09:06Z

Both the StateDirectory= statement, and the tmpfiles rules take care of creating the directory if it doesn't exist.

The StateDirectory= will take care of the directories to be present on a (re)start - I only created the tmpfile rules to support the initialization phase, so directories already exist during initialization, without the services being started.

The initialization process currently is a very manual thing by itself, and apparently only "documented" by the nixos vm tests. You first switch to a generation with available, but disabled ceph services, do the initialization, start services, then switch to a generation autostarting these services. We might want to change this in the future, to provide a better out-of the box experience.

First and foremost, the initialisation process is not documented by us because the tests use the manual deploy from here.

Secondly, its mainly out of convenience that we disable autostart from the beginning, so that the testlog isn't spammed with failed startups of daemons before we've even started the whole initialization process (honestly tho, the comment could be better at explaining that to be completely fair). Its not a requirement to do so, just makes the log easier to read for the one running the test.

Thirdly, the way you initialize stuff in a ceph cluster highly depends on how the sysadmins want to do it, there is at least 3 different ways to do it (with the manual one being the one that is easiest to put in a test, because it doesn't depend on a specific ceph tool or other automation tool). The ceph module was written to accomodate the running of a cluster, and leaving all the initialization up to the sysadmins to fix before enabling the running of daemons.
Simply because outside the very uncommon and developer-centric case of a single node cluster, there isn't really any way for us to do the full initialization in units PreStart or similar activationscripts. That is unless the user wants to run a cluster without auth, but then they are, from the view of upstream, kinda on their own because that isn't a recommended way to run a cluster, especially not in a production environment.
With that said tho there is nothing stopping us from creating an option which enables an activationscript that sets up a single-node cluster so that at least those that want to use that gets a nice and fully automated way to set up a cluster.

One option would be to have some generators generating individual ceph-osd services that do the initialization and startup depending on the actual disks being attached, ConditionPathExists for all the management services requiring a keyring etc. etc.… But that might be something for upstream too.

Sadly, the initialization is quite stateful so its most likely not possible for us to automate the initialization process ourselves, especially as cephx (auth) is default nowdays and its not recommended to run a cluster without it (which is what would be required for us to be able to automate the initialization so to speak).
Even "just" initializing an OSD requires either a admin keyring or bootstrap-osd keyring to be present on a system.
However, if we completely ignore the whole auth part, utilizing ceph-volume tool we could be able to automate the process of at least OSDs very easily, because ceph-volume takes care of all that if you just give it the proper datapaths to be used for the OSD.

@srhb I'm fine with also removing the tmpfile rules and adding back the mkdir -p commands in the ceph tests doing the initialization, if you prefer that.

We seem to be relying on those being present during runtime anyways.

…pass extraServiceConfig Don't pass user and group to ceph, and rely on it to drop ceps, but let systemd handle running it as the appropriate user. This also inlines the extraServiceConfig into the makeService function, as we have conditionals depending on daemonType there anyways. Use StateDirectory to create directories in /var/lib/ceph/${daemonType}/${clusterName}-${daemonId}. There previously was a condition on daemonType being one of mds,mon,rgw or mgr. We only instantiate makeServices with these types, and "osd" was special. In the osd case, test examples suggest it'd be in something like /var/lib/ceph/osd/ceph-${cfg.osd0.name} - so it's not special at all, but exactly like the pattern for the others. During initialization, we also need these folders, before the unit is started up. Move the mkdir -p commands in the vm tests to the line immediately before they're required.

This prevents services to be started before they're initialized, and renders the `systemd.targets.ceph.wantedBy = lib.mkForce [];` hack in the vm tests obsolete - The config now starts up ceph after a reboot, too. Let's take advantage of that, crash all VMs, and boot them up again.

flokli · 2019-11-09T15:17:37Z

I agree automating parts of the initialization is tricky - reading up on the manual installation guide, they document the state directories should be created during initialization, so I'm fine with having them in our initialization aswell, instead of in a tmpfiles rule.

I tinkered a bit with ConditionPathExists, and it's indeed a much nicer way to prevent services from starting up while being not initialized - even better, the config now is reboot-able, so we can forcibly crash all VMs and ensure the cluster comes back up again :-)

lejonet

Its much cleaner to ensure that a daemon doesn't start when not initialized by actually coupling the unit with its state directory than simply let the daemon crash.

Also as @flokli stated in a discussion on IRC, systemctl status will explicitly tell the user now why it didn't start, instead of hoping that the user can deduce that from the errors.

flokli · 2019-11-09T16:02:02Z

@GrahamcOfBorg test ceph-single-node ceph-multi-node

flokli · 2019-11-09T19:14:32Z

@GrahamcOfBorg test ceph-single-node ceph-multi-node

…e bootstraping method

flokli · 2019-11-10T23:41:57Z

I added some more cleanup to the ceph derivation itself in #73187.

srhb · 2019-11-11T07:55:16Z

This looks good to me! I am a little concerned about the tmpfiles setup, because I've previously been bitten by the impedance mismatch in initialization in systemd units versus initialization during activation. That said, this isn't the first module to start doing it this way, and I'm not sure I actually have a coherent criticism of it yet, so let's just see how it works out.

I might be able to find time to deploy this to our test cluster soonish, but probably not in time to hold up the merge.

flokli requested review from krav, lejonet and johanot November 2, 2019 17:41

ofborg bot added 6.topic: nixos 8.has: module (update) 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels Nov 2, 2019

lejonet reviewed Nov 2, 2019

View reviewed changes

lejonet mentioned this pull request Nov 9, 2019

nixos/ceph: Clarify comments in tests and add source reference for the bootstraping method #73109

Closed

10 tasks

flokli added 3 commits November 9, 2019 15:27

nixos/ceph: create /etc/ceph and /var/lib/ceph via tmpfiles

64c9c08

We seem to be relying on those being present during runtime anyways.

flokli force-pushed the ceph-tmpfiles branch from 808d454 to ffd0060 Compare November 9, 2019 15:15

flokli changed the title ~~nixos/ceph: run unprivileged, use StateDirectory and tmpfiles, don't pass extraServiceConfig~~ nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch Nov 9, 2019

tfc mentioned this pull request Nov 9, 2019

nixos/tests: Leftover tasks towards unified Python integration tests #72828

Closed

lejonet approved these changes Nov 9, 2019

View reviewed changes

ofborg bot requested a review from adevress November 9, 2019 16:11

ofborg bot added 10.rebuild-darwin: 1-10 10.rebuild-linux: 1-10 and removed 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels Nov 9, 2019

nixos/ceph: Clarify comments in tests and add source reference for th…

1972904

…e bootstraping method

flokli force-pushed the ceph-tmpfiles branch from 7a33b8f to 1972904 Compare November 9, 2019 22:50

ofborg bot added 10.rebuild-darwin: 0 10.rebuild-linux: 0 and removed 10.rebuild-darwin: 1-10 10.rebuild-linux: 1-10 labels Nov 9, 2019

flokli assigned srhb Nov 10, 2019

flokli mentioned this pull request Nov 11, 2019

WIP: nixos/ceph: port tests to python #73190

Merged

10 tasks

srhb approved these changes Nov 11, 2019

View reviewed changes

flokli merged commit 60390c8 into NixOS:master Nov 11, 2019

flokli deleted the ceph-tmpfiles branch November 11, 2019 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch #72603

nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch #72603

flokli commented Nov 2, 2019 •

edited

Loading

lejonet left a comment

lejonet Nov 2, 2019

flokli Nov 2, 2019

lejonet Nov 9, 2019

flokli commented Nov 8, 2019

srhb commented Nov 8, 2019

flokli commented Nov 8, 2019 •

edited

Loading

lejonet commented Nov 9, 2019 •

edited

Loading

flokli commented Nov 9, 2019

lejonet left a comment

flokli commented Nov 9, 2019

flokli commented Nov 9, 2019

flokli commented Nov 10, 2019

srhb commented Nov 11, 2019

		++ (map (daemonName: "d /var/lib/ceph/mgr/${cfg.global.clusterName}-${daemonName} - ceph ceph - -") cfg.mgr.daemons)
		++ (map (daemonName: "d /var/lib/ceph/osd/${cfg.global.clusterName}-${daemonName} - ceph ceph - -") cfg.osd.daemons);

nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch #72603

nixos/ceph: run unprivileged, use state directories, handle non-initialized clusters without config switch #72603

Conversation

flokli commented Nov 2, 2019 • edited Loading

Motivation for this change

Things done

Notify maintainers

lejonet left a comment

Choose a reason for hiding this comment

lejonet Nov 2, 2019

Choose a reason for hiding this comment

flokli Nov 2, 2019

Choose a reason for hiding this comment

lejonet Nov 9, 2019

Choose a reason for hiding this comment

flokli commented Nov 8, 2019

srhb commented Nov 8, 2019

flokli commented Nov 8, 2019 • edited Loading

lejonet commented Nov 9, 2019 • edited Loading

flokli commented Nov 9, 2019

lejonet left a comment

Choose a reason for hiding this comment

flokli commented Nov 9, 2019

flokli commented Nov 9, 2019

flokli commented Nov 10, 2019

srhb commented Nov 11, 2019

flokli commented Nov 2, 2019 •

edited

Loading

flokli commented Nov 8, 2019 •

edited

Loading

lejonet commented Nov 9, 2019 •

edited

Loading