-
-
Notifications
You must be signed in to change notification settings - Fork 15.5k
nixos: Add 'confinement' options to systemd.services #57519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f759370
to
9999b0b
Compare
Just updated the description a bit to make it less confusing to read :-) |
Cc: @edolstra, @domenkozar regarding the |
Thanks, I've wanted this for a long time. A few comments:
|
As pointed out in the original post, the reason for this is to only support the
Ah, I'm already using
This is for getting additional store paths into the chroot, for example like this: {
systemd.services.someservice = {
environment.CONFIG = someconfig;
chroot.enable = true;
chroot.packages = [ someconfig ];
serviceConfig = { /* ... */ };
};
}
I specifically didn't want to use the whole unit, because it also contains stuff like However, I could implement that by adding another confinement mode that includes the whole unit.
Unfortunately, PostgreSQL is a bit tricky to convert, because it uses |
Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times[1][2] in the past. However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.yourservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; chroot.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes "script" and {pre,post}Start) need to be in the chroot, it can be specified using the chroot.packages option. By default (which uses the "full-apivfs"[3] confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for "/bin/sh -c something". Prior to submitting this here, I did a first implementation of this outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality from systemd-lib.nix, just because it's only a single line. However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-chroot implementation also benefits from fixes to that functionality (which is now a proper function). Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup. Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. [1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62 [2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124 [3]: The reason this is called "full-apivfs" instead of just "full" is to make room for a *real* "full" confinement mode, which is more restrictive even. [4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix Signed-off-by: aszlig <aszlig@nix.build>
Quoting @edolstra from [1]: I don't really like the name "chroot", something like "confine[ment]" or "restrict" seems better. Conceptually we're not providing a completely different filesystem tree but a restricted view of the same tree. I already used "confinement" as a sub-option and I do agree that "chroot" sounds a bit too specific (especially because not *only* chroot is involved). So this changes the module name and its option to use "confinement" instead of "chroot" and also renames the "chroot.confinement" to "confinement.mode". [1]: NixOS#57519 (comment) Signed-off-by: aszlig <aszlig@nix.build>
Another thing requested by @edolstra in [1]: We should not provide a different /bin/sh in the chroot, that's just asking for confusion and random shell script breakage. It should be the same shell (i.e. bash) as in a regular environment. While I personally would even go as far to even have a very restricted shell that is not even a shell and basically *only* allows "/bin/sh -c" with only *very* minimal parsing of shell syntax, I do agree that people expect /bin/sh to be bash (or the one configured by environment.binsh) on NixOS. So this should make both others and me happy in that I could just use confinement.binSh = "${pkgs.dash}/bin/dash" for the services I confine. [1]: NixOS#57519 (comment) Signed-off-by: aszlig <aszlig@nix.build>
@edolstra: Rebased with a few fixups and added the things you mentioned, except the one about adding the full unit closure to the chroot. When looking at how I used this in the past, I frequently use I could however add an option which allows to include the full unit closure, but I'm very reluctant to make it the default. |
From @edolstra at [1]: BTW we probably should take the closure of the whole unit rather than just the exec commands, to handle things like Environment variables. With this commit, there is now a "fullUnit" option, which can be enabled to include the full closure of the service unit into the chroot. However, I did not enable this by default, because I do disagree here and *especially* things like environment variables or environment files shouldn't be in the closure of the chroot. For example if you have something like: { pkgs, ... }: { systemd.services.foobar = { serviceConfig.EnvironmentFile = ${pkgs.writeText "secrets" '' user=admin password=abcdefg ''; }; } We really do not want the *file* to end up in the chroot, but rather just the environment variables to be exported. Another thing is that this makes it less predictable what actually will end up in the chroot, because we have a "globalEnvironment" option that will get merged in as well, so users adding stuff to that option will also make it available in confined units. I also added a big fat warning about that in the description of the fullUnit option. [1]: NixOS#57519 (comment) Signed-off-by: aszlig <aszlig@nix.build>
Added. |
Well, if you think that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also agree with @edolstra on fullUnit, packages
should be good enough in those cases where ExecStart
doesn't reference what you need.
We can always add more options later if they turn out to be needed.
Maybe I'm a bit late, but what about reuse |
@danbst: I thought about Nevertheless however, it's not too late, we can still change it :-) |
My implementation was relying on PrivateDevices, PrivateTmp, PrivateUsers and others to be false by default if chroot-only mode is used. However there is an ongoing effort[1] to change these defaults, which then will actually increase the attack surface in chroot-only mode, because it is expected that there is no /dev, /sys or /proc. If for example PrivateDevices is enabled by default, there suddenly will be a mounted /dev in the chroot and we wouldn't detect it. Fortunately, our tests cover that, but I'm preparing for this anyway so that we have a smoother transition without the need to fix our implementation again. Thanks to @infinisil for the heads-up. [1]: NixOS#14645 Signed-off-by: aszlig <aszlig@nix.build>
The default, which is /tmp, has a few issues associated with it: One being that it makes it easy for users on the system to spoof a PostgreSQL server if it's not running, causing applications to connect to their provided sockets instead of just failing to connect. Another one is that it makes sandboxing of PostgreSQL and other services unnecessarily difficult. This is already the case if only PrivateTmp is used in a systemd service, so in order for such a service to be able to connect to PostgreSQL, a bind mount needs to be done from /tmp to some other path, so the service can access it. This pretty much defeats the whole purpose of PrivateTmp. We regularily run into issues with this in the past already (one example would be NixOS#24317) and with the new systemd-confinement mode upcoming in NixOS#57519, it makes it even more tedious to sandbox services. I've tested this change against all the postgresql NixOS VM tests and they still succeed and I also grepped through the source tree to replace other occasions where we might have /tmp hardcoded. Luckily there were very few occasions. Signed-off-by: aszlig <aszlig@nix.build> Cc: @ocharles, @thoughtpolice, @danbst
Noted by @infinisil on IRC: infinisil: Question regarding the confinement PR infinisil: On line 136 you do different things depending on RootDirectoryStartOnly infinisil: But on line 157 you have an assertion that disallows that option being true infinisil: Is there a reason behind this or am I missing something I originally left this in so that once systemd supports that, we can just flip a switch and remove the assertion and thus support RootDirectoryStartOnly for our confinement module. However, this doesn't seem to be on the roadmap for systemd in the foreseeable future, so I'll just remove this, especially because it's very easy to add it again, once it is supported. Signed-off-by: aszlig <aszlig@nix.build>
So far we had MountFlags = "private", but as @infinisil has correctly noticed, there is a dedicated PrivateMounts option, which does exactly that and is better integrated than providing raw mount flags. When checking for the reason why I used MountFlags instead of PrivateMounts, I found that at the time I wrote the initial version of this module (Mar 12 06:15:58 2018 +0100) the PrivateMounts option didn't exist yet and has been added to systemd in Jun 13 08:20:18 2018 +0200. Signed-off-by: aszlig <aszlig@nix.build>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave some review over IRC, which @aszlig addressed with the last two commits. This PR looks good to me now and I'd like to merge this soon.
I also added the Just also checking back with @samueldr and @lheckemann whether we possibly want this in NixOS 19.03. |
👍 for backporting as this only enhances the feature set of nixos, and does not (at a glance) change existing behaviours. |
First of all, the reason I added this to the "highlights" section is that we want users to be aware of these options, because in the end we really want to decrease the attack surface of NixOS services and this is a step towards improving that situation. The reason why I'm adding this to the changelog of the NixOS 19.03 release instead of 19.09 is that it makes backporting services that use these options easier. Doing the backport of the confinement module after the official release would mean that it's not part of the release announcement and potentially could fall under the radar of most users. These options and the whole module also do not change anything in existing services or affect other modules, so they're purely optional. Adding this "last minute" to the 19.03 release doesn't hurt and is probably a good preparation for the next months where we hopefully confine as much services as we can :-) I also have asked @samueldr and @lheckemann, whether they're okay with the inclusion in 19.03. While so far only @samueldr has accepted the change, we can still move the changelog entry to the NixOS 19.09 release notes in case @lheckemann rejects it. Signed-off-by: aszlig <aszlig@nix.build>
I'm neutral on backporting. On the one hand, as @samueldr says, this doesn't change any existing functionality. On the other hand, it's a new feature and we're well past the feature freeze, which AFAIU is intended so that new, potentially buggy features aren't introduced in the "stabilisation period". It is a cool feature though… :) |
I don't fully understand why this doesn't interact well with |
Edit: I'll try to make it work right now, but I can't promise anything. If it's too complicated to do we can still lift this restriction later. |
@arianvp: Okay, this is getting a bit more complicated, because systemd doesn't seem to take the |
Currently if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store or use a separate derivation which gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. However, this process is a bit tedious, so the changes here implement this in a more generic way. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.myservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; confinement.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes script and {pre,post}Start) need to be in the chroot, it can be specified using the confinement.packages option. By default (which uses the full-apivfs confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided, which is useful for most programs that use the system() C library call to execute commands via shell. Unfortunately, there are a few limitations at the moment. The first being that DynamicUser doesn't work in conjunction with tmpfs, because systemd seems to ignore the TemporaryFileSystem option if DynamicUser is enabled. I started implementing a workaround to do this, but I decided to not include it as part of this pull request, because it needs a lot more testing to ensure it's consistent with the behaviour without DynamicUser. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and doesn't include/exclude the individual bind mounts or the tmpfs. A quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. The changes also come with a changelog entry for NixOS 19.03, which is why I asked for a vote of the NixOS 19.03 stable maintainers whether to include it (I admit it's a bit late a few days before official release, sorry for that): @samueldr: Via pull request comment[1]: +1 for backporting as this only enhances the feature set of nixos, and does not (at a glance) change existing behaviours. Via IRC: new feature: -1, tests +1, we're at zero, self-contained, with no global effects without actively using it, +1, I think it's good @lheckemann: Via pull request comment[2]: I'm neutral on backporting. On the one hand, as @samueldr says, this doesn't change any existing functionality. On the other hand, it's a new feature and we're well past the feature freeze, which AFAIU is intended so that new, potentially buggy features aren't introduced in the "stabilisation period". It is a cool feature though? :) A few other people on IRC didn't have opposition either against late inclusion into NixOS 19.03: @edolstra: "I'm not against it" @infinisil: "+1 from me as well" @grahamc: "IMO its up to the RMs" So that makes +1 from @samueldr, 0 from @lheckemann, 0 from @edolstra and +1 from @infinisil (even though he's not a release manager) and no opposition from anyone, which is the reason why I'm merging this right now. I also would like to thank @infinisil, @edolstra and @danbst for their reviews. [1]: NixOS#57519 (comment) [2]: NixOS#57519 (comment)
Currently if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store or use a separate derivation which gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot. This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. However, this process is a bit tedious, so the changes here implement this in a more generic way. Now if you want to chroot a systemd service, all you need to do is: { systemd.services.myservice = { description = "My Shiny Service"; wantedBy = [ "multi-user.target" ]; confinement.enable = true; serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice"; }; } If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes script and {pre,post}Start) need to be in the chroot, it can be specified using the confinement.packages option. By default (which uses the full-apivfs confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately. In addition - and by default - a /bin/sh executable is provided, which is useful for most programs that use the system() C library call to execute commands via shell. Unfortunately, there are a few limitations at the moment. The first being that DynamicUser doesn't work in conjunction with tmpfs, because systemd seems to ignore the TemporaryFileSystem option if DynamicUser is enabled. I started implementing a workaround to do this, but I decided to not include it as part of this pull request, because it needs a lot more testing to ensure it's consistent with the behaviour without DynamicUser. The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and doesn't include/exclude the individual bind mounts or the tmpfs. A quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure. The changes also come with a changelog entry for NixOS 19.03, which is why I asked for a vote of the NixOS 19.03 stable maintainers whether to include it (I admit it's a bit late a few days before official release, sorry for that): @samueldr: Via pull request comment[1]: +1 for backporting as this only enhances the feature set of nixos, and does not (at a glance) change existing behaviours. Via IRC: new feature: -1, tests +1, we're at zero, self-contained, with no global effects without actively using it, +1, I think it's good @lheckemann: Via pull request comment[2]: I'm neutral on backporting. On the one hand, as @samueldr says, this doesn't change any existing functionality. On the other hand, it's a new feature and we're well past the feature freeze, which AFAIU is intended so that new, potentially buggy features aren't introduced in the "stabilisation period". It is a cool feature though? :) A few other people on IRC didn't have opposition either against late inclusion into NixOS 19.03: @edolstra: "I'm not against it" @infinisil: "+1 from me as well" @grahamc: "IMO its up to the RMs" So that makes +1 from @samueldr, 0 from @lheckemann, 0 from @edolstra and +1 from @infinisil (even though he's not a release manager) and no opposition from anyone, which is the reason why I'm merging this right now. I also would like to thank @infinisil, @edolstra and @danbst for their reviews. [1]: #57519 (comment) [2]: #57519 (comment) (cherry picked from commit dcf40f7)
I would've squashed a bit, but oh well |
The systemd-chroot module here in this repository has already been upstreamed[1] quite a while ago as "systemd-confinement" with "confinement" as the option namespace. Since we no longer want or need to support older releases than NixOS 19.03, we can safely drop the module on our side and rename the options accordingly. [1]: NixOS/nixpkgs#57519 Signed-off-by: aszlig <aszlig@nix.build>
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/nixos-policy-regarding-systemd-confinement/18976/1 |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/hardening-systemd-services/17147/18 |
Problem statement
Currently, if you want to properly chroot a systemd service, you could do it using
BindReadOnlyPaths=/nix/store
(which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, calledProtectSystem
, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot.This can be achieved using
pkgs.closureInfo
and a small derivation that packs everything into a systemd unit, which later can be added tosystemd.packages
. That's also what I did several times in the past.However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born.
Description
Now if you want to chroot a systemd service, all you need to do is:
If more than the dependencies for the
ExecStart*
andExecStop*
(which btw. also includesscript
and{pre,post}Start
) need to be in the chroot, it can be specified using theconfinement.packages
option. By default (which uses thefull-apivfs
1 confinement mode), a user namespace is set up as well and/proc
,/sys
and/dev
are mounted appropriately.In addition - and by default - a
/bin/sh
executable is provided as well, which is useful for most programs that use thesystem()
C library call to execute commands via shell.The shell providing/bin/sh
isdash
instead of the default in NixOS (which is
bash
), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for/bin/sh -c something
.Prior to submitting this here, I did a first implementation of this outside of nixpkgs, which duplicated the pathSafeName functionality from
systemd-lib.nix
, just because it's only a single line.However, I decided to just re-use the one from systemd here and subsequently made it available when importing
systemd-lib.nix
, so that thesystemd-confinement
implementation also benefits from fixes to that functionality (which is now a proper function).Unfortunately, we do have a few limitations as well. The first being that
DynamicUser
doesn't work in conjunction withtmpfs
, because it already sets up atmpfs
in a different path and simply ignores the onewe define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever
DynamicUser
is enabled.The second limitation/issue is that
RootDirectoryStartOnly
doesn't work right now, because it only affects theRootDirectory
option and not the individual bind mounts or ourtmpfs
. It would be helpful if systemdwould have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup.
Another quirk we do have right now is that systemd tries to create a
/usr
directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.1 The reason this is called
full-apivfs
instead of justfull
is to make room for a realfull
confinement mode in the future (once the systemd interface has improved), which is more restrictive even. ↑