Skip to content

nixos: Add 'confinement' options to systemd.services #57519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 29, 2019

Conversation

aszlig
Copy link
Member

@aszlig aszlig commented Mar 12, 2019

Problem statement

Currently, if you want to properly chroot a systemd service, you could do it using BindReadOnlyPaths=/nix/store (which is not what I'd call "properly", because the whole store is still accessible) or use a separate derivation that gathers the runtime closure of the service you want to chroot. The former is the easier method and there is also a method directly offered by systemd, called ProtectSystem, which still leaves the whole store accessible. The latter however is a bit more involved, because you need to bind-mount each store path of the runtime closure of the service you want to chroot.

This can be achieved using pkgs.closureInfo and a small derivation that packs everything into a systemd unit, which later can be added to systemd.packages. That's also what I did several times in the past.

However, this process got a bit tedious, so I decided that it would be generally useful for NixOS, so this very implementation was born.

Description

Now if you want to chroot a systemd service, all you need to do is:

  {
    systemd.services.yourservice = {
      description = "My Shiny Service";
      wantedBy = [ "multi-user.target" ];

      confinement.enable = true;
      serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
    };
  }

If more than the dependencies for the ExecStart* and ExecStop* (which btw. also includes script and {pre,post}Start) need to be in the chroot, it can be specified using the confinement.packages option. By default (which uses the full-apivfs1 confinement mode), a user namespace is set up as well and /proc, /sys and /dev are mounted appropriately.

In addition - and by default - a /bin/sh executable is provided as well, which is useful for most programs that use the system() C library call to execute commands via shell. The shell providing /bin/sh is dash
instead of the default in NixOS (which is bash), because it's way more lightweight and after all we're chrooting because we want to lower the attack surface and it should be only used for /bin/sh -c something.

Prior to submitting this here, I did a first implementation of this outside of nixpkgs, which duplicated the pathSafeName functionality from systemd-lib.nix, just because it's only a single line.

However, I decided to just re-use the one from systemd here and subsequently made it available when importing systemd-lib.nix, so that the systemd-confinement implementation also benefits from fixes to that functionality (which is now a proper function).

Unfortunately, we do have a few limitations as well. The first being that DynamicUser doesn't work in conjunction with tmpfs, because it already sets up a tmpfs in a different path and simply ignores the one
we define. We could probably solve this by detecting it and try to bind-mount our paths to that different path whenever DynamicUser is enabled.

The second limitation/issue is that RootDirectoryStartOnly doesn't work right now, because it only affects the RootDirectory option and not the individual bind mounts or our tmpfs. It would be helpful if systemd
would have a way to disable specific bind mounts as well or at least have some way to ignore failures for the bind mounts/tmpfs setup.

Another quirk we do have right now is that systemd tries to create a /usr directory within the chroot, which subsequently fails. Fortunately, this is just an ugly error and not a hard failure.

1 The reason this is called full-apivfs instead of just full is to make room for a real full confinement mode in the future (once the systemd interface has improved), which is more restrictive even.

@aszlig aszlig added 0.kind: enhancement Add something new 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (new) This PR adds a module in `nixos/` labels Mar 12, 2019
@aszlig aszlig requested a review from infinisil as a code owner March 12, 2019 13:29
@GrahamcOfBorg GrahamcOfBorg added 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Mar 12, 2019
@aszlig aszlig force-pushed the systemd-chroot branch 2 times, most recently from f759370 to 9999b0b Compare March 14, 2019 03:19
@aszlig
Copy link
Member Author

aszlig commented Mar 14, 2019

Just updated the description a bit to make it less confusing to read :-)

@aszlig
Copy link
Member Author

aszlig commented Mar 14, 2019

Cc: @edolstra, @domenkozar regarding the pathSafeName change.

@edolstra
Copy link
Member

Thanks, I've wanted this for a long time. A few comments:

  • We should not provide a different /bin/sh in the chroot, that's just asking for confusion and random shell script breakage. It should be the same shell (i.e. bash) as in a regular environment.

  • I don't really like the name "chroot", something like "confine[ment]" or "restrict" seems better. Conceptually we're not providing a completely different filesystem tree but a restricted view of the same tree.

  • What's the use case for chroot.packages? If a store path doesn't appear in the closure of the exec commands, then how would they even be able to find that path? (BTW we probably should take the closure of the whole unit rather than just the exec commands, to handle things like Environment variables).

  • We should convert a few services (e.g. PostgreSQL) to use this. (That will also provide some validation.)

@aszlig
Copy link
Member Author

aszlig commented Mar 14, 2019

We should not provide a different /bin/sh in the chroot, that's just asking for confusion and random shell script breakage. It should be the same shell (i.e. bash) as in a regular environment.

As pointed out in the original post, the reason for this is to only support the /bin/sh -c something case and ideally use an even smaller dummy-shell, that only supports that case but nothing else.

I don't really like the name "chroot", something like "confine[ment]" or "restrict" seems better. Conceptually we're not providing a completely different filesystem tree but a restricted view of the same tree.

Ah, I'm already using confinement as a suboption, but I'll change that to mode and rename the whole option.

What's the use case for chroot.packages? If a store path doesn't appear in the closure of the exec commands, then how would they even be able to find that path?

This is for getting additional store paths into the chroot, for example like this:

{
  systemd.services.someservice = {
    environment.CONFIG = someconfig;
    chroot.enable = true;
    chroot.packages = [ someconfig ];
    serviceConfig = { /* ... */ };
  };
}

(BTW we probably should take the closure of the whole unit rather than just the exec commands, to handle things like Environment variables).

I specifically didn't want to use the whole unit, because it also contains stuff like coreutils, which I was trying to avoid during my sandboxing efforts to prevent applications from executing weird stuff I don't want to have.

However, I could implement that by adding another confinement mode that includes the whole unit.

We should convert a few services (e.g. PostgreSQL) to use this. (That will also provide some validation.)

Unfortunately, PostgreSQL is a bit tricky to convert, because it uses /tmp for its socket. We could however bind-mount it to a different directory within the chroot.

aszlig added 3 commits March 14, 2019 19:14
Currently, if you want to properly chroot a systemd service, you could
do it using BindReadOnlyPaths=/nix/store (which is not what I'd call
"properly", because the whole store is still accessible) or use a
separate derivation that gathers the runtime closure of the service you
want to chroot. The former is the easier method and there is also a
method directly offered by systemd, called ProtectSystem, which still
leaves the whole store accessible. The latter however is a bit more
involved, because you need to bind-mount each store path of the runtime
closure of the service you want to chroot.

This can be achieved using pkgs.closureInfo and a small derivation that
packs everything into a systemd unit, which later can be added to
systemd.packages. That's also what I did several times[1][2] in the
past.

However, this process got a bit tedious, so I decided that it would be
generally useful for NixOS, so this very implementation was born.

Now if you want to chroot a systemd service, all you need to do is:

  {
    systemd.services.yourservice = {
      description = "My Shiny Service";
      wantedBy = [ "multi-user.target" ];

      chroot.enable = true;
      serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
    };
  }

If more than the dependencies for the ExecStart* and ExecStop* (which
btw. also includes "script" and {pre,post}Start) need to be in the
chroot, it can be specified using the chroot.packages option. By
default (which uses the "full-apivfs"[3] confinement mode), a user
namespace is set up as well and /proc, /sys and /dev are mounted
appropriately.

In addition - and by default - a /bin/sh executable is provided as well,
which is useful for most programs that use the system() C library call
to execute commands via shell. The shell providing /bin/sh is dash
instead of the default in NixOS (which is bash), because it's way more
lightweight and after all we're chrooting because we want to lower the
attack surface and it should be only used for "/bin/sh -c something".

Prior to submitting this here, I did a first implementation of this
outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality
from systemd-lib.nix, just because it's only a single line.

However, I decided to just re-use the one from systemd here and
subsequently made it available when importing systemd-lib.nix, so that
the systemd-chroot implementation also benefits from fixes to that
functionality (which is now a proper function).

Unfortunately, we do have a few limitations as well. The first being
that DynamicUser doesn't work in conjunction with tmpfs, because it
already sets up a tmpfs in a different path and simply ignores the one
we define. We could probably solve this by detecting it and try to
bind-mount our paths to that different path whenever DynamicUser is
enabled.

The second limitation/issue is that RootDirectoryStartOnly doesn't work
right now, because it only affects the RootDirectory option and not the
individual bind mounts or our tmpfs. It would be helpful if systemd
would have a way to disable specific bind mounts as well or at least
have some way to ignore failures for the bind mounts/tmpfs setup.

Another quirk we do have right now is that systemd tries to create a
/usr directory within the chroot, which subsequently fails. Fortunately,
this is just an ugly error and not a hard failure.

[1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62
[2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124
[3]: The reason this is called "full-apivfs" instead of just "full" is
     to make room for a *real* "full" confinement mode, which is more
     restrictive even.
[4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix

Signed-off-by: aszlig <aszlig@nix.build>
Quoting @edolstra from [1]:

  I don't really like the name "chroot", something like "confine[ment]"
  or "restrict" seems better. Conceptually we're not providing a
  completely different filesystem tree but a restricted view of the same
  tree.

I already used "confinement" as a sub-option and I do agree that
"chroot" sounds a bit too specific (especially because not *only* chroot
is involved).

So this changes the module name and its option to use "confinement"
instead of "chroot" and also renames the "chroot.confinement" to
"confinement.mode".

[1]: NixOS#57519 (comment)

Signed-off-by: aszlig <aszlig@nix.build>

Verified

This commit was signed with the committer’s verified signature.
michaelpj Michael Peyton Jones
Another thing requested by @edolstra in [1]:

  We should not provide a different /bin/sh in the chroot, that's just
  asking for confusion and random shell script breakage. It should be
  the same shell (i.e. bash) as in a regular environment.

While I personally would even go as far to even have a very restricted
shell that is not even a shell and basically *only* allows "/bin/sh -c"
with only *very* minimal parsing of shell syntax, I do agree that people
expect /bin/sh to be bash (or the one configured by environment.binsh)
on NixOS.

So this should make both others and me happy in that I could just use
confinement.binSh = "${pkgs.dash}/bin/dash" for the services I confine.

[1]: NixOS#57519 (comment)

Signed-off-by: aszlig <aszlig@nix.build>
@aszlig
Copy link
Member Author

aszlig commented Mar 14, 2019

@edolstra: Rebased with a few fixups and added the things you mentioned, except the one about adding the full unit closure to the chroot. When looking at how I used this in the past, I frequently use EnvironmentFile/environment to pass secrets to the service. I wouldn't want those to ever be available to the service in question and as mentioned, there is still the packages option to provide extra paths to add to the closure.

I could however add an option which allows to include the full unit closure, but I'm very reluctant to make it the default.

From @edolstra at [1]:

  BTW we probably should take the closure of the whole unit rather than
  just the exec commands, to handle things like Environment variables.

With this commit, there is now a "fullUnit" option, which can be enabled
to include the full closure of the service unit into the chroot.

However, I did not enable this by default, because I do disagree here
and *especially* things like environment variables or environment files
shouldn't be in the closure of the chroot.

For example if you have something like:

  { pkgs, ... }:

  {
    systemd.services.foobar = {
      serviceConfig.EnvironmentFile = ${pkgs.writeText "secrets" ''
        user=admin
        password=abcdefg
      '';
    };
  }

We really do not want the *file* to end up in the chroot, but rather
just the environment variables to be exported.

Another thing is that this makes it less predictable what actually will
end up in the chroot, because we have a "globalEnvironment" option that
will get merged in as well, so users adding stuff to that option will
also make it available in confined units.

I also added a big fat warning about that in the description of the
fullUnit option.

[1]: NixOS#57519 (comment)

Signed-off-by: aszlig <aszlig@nix.build>
@aszlig
Copy link
Member Author

aszlig commented Mar 14, 2019

I could however add an option which allows to include the full unit closure, but I'm very reluctant to make it the default.

Added.

@aszlig aszlig changed the title nixos: Add 'chroot' options to systemd.services nixos: Add 'confinement' options to systemd.services Mar 14, 2019
@edolstra
Copy link
Member

Well, if you think that fullUnit is a bad idea, then maybe it's better not to have it.

Copy link
Member

@infinisil infinisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agree with @edolstra on fullUnit, packages should be good enough in those cases where ExecStart doesn't reference what you need.

We can always add more options later if they turn out to be needed.

@aszlig aszlig requested a review from infinisil March 15, 2019 01:21
@danbst
Copy link
Contributor

danbst commented Mar 15, 2019

Maybe I'm a bit late, but what about reuse sandbox word instead of confinement?

@aszlig
Copy link
Member Author

aszlig commented Mar 15, 2019

@danbst: I thought about sandbox in the first place, but for a full sandbox I think this is not exhaustive enough because it only confines on the file system level. That's why I also went for chroot in my original implementation.

Nevertheless however, it's not too late, we can still change it :-)

My implementation was relying on PrivateDevices, PrivateTmp,
PrivateUsers and others to be false by default if chroot-only mode is
used.

However there is an ongoing effort[1] to change these defaults, which
then will actually increase the attack surface in chroot-only mode,
because it is expected that there is no /dev, /sys or /proc.

If for example PrivateDevices is enabled by default, there suddenly will
be a mounted /dev in the chroot and we wouldn't detect it.

Fortunately, our tests cover that, but I'm preparing for this anyway so
that we have a smoother transition without the need to fix our
implementation again.

Thanks to @infinisil for the heads-up.

[1]: NixOS#14645

Signed-off-by: aszlig <aszlig@nix.build>
aszlig added a commit to aszlig/nixpkgs that referenced this pull request Mar 15, 2019

Verified

This commit was signed with the committer’s verified signature.
michaelpj Michael Peyton Jones
The default, which is /tmp, has a few issues associated with it:

One being that it makes it easy for users on the system to spoof a
PostgreSQL server if it's not running, causing applications to connect
to their provided sockets instead of just failing to connect.

Another one is that it makes sandboxing of PostgreSQL and other services
unnecessarily difficult. This is already the case if only PrivateTmp is
used in a systemd service, so in order for such a service to be able to
connect to PostgreSQL, a bind mount needs to be done from /tmp to some
other path, so the service can access it. This pretty much defeats the
whole purpose of PrivateTmp.

We regularily run into issues with this in the past already (one example
would be NixOS#24317) and with the new
systemd-confinement mode upcoming in
NixOS#57519, it makes it even more
tedious to sandbox services.

I've tested this change against all the postgresql NixOS VM tests and
they still succeed and I also grepped through the source tree to replace
other occasions where we might have /tmp hardcoded. Luckily there were
very few occasions.

Signed-off-by: aszlig <aszlig@nix.build>
Cc: @ocharles, @thoughtpolice, @danbst
aszlig added 2 commits March 27, 2019 20:22
Noted by @infinisil on IRC:

   infinisil: Question regarding the confinement PR
   infinisil: On line 136 you do different things depending on
              RootDirectoryStartOnly
   infinisil: But on line 157 you have an assertion that disallows that
              option being true
   infinisil: Is there a reason behind this or am I missing something

I originally left this in so that once systemd supports that, we can
just flip a switch and remove the assertion and thus support
RootDirectoryStartOnly for our confinement module.

However, this doesn't seem to be on the roadmap for systemd in the
foreseeable future, so I'll just remove this, especially because it's
very easy to add it again, once it is supported.

Signed-off-by: aszlig <aszlig@nix.build>
So far we had MountFlags = "private", but as @infinisil has correctly
noticed, there is a dedicated PrivateMounts option, which does exactly
that and is better integrated than providing raw mount flags.

When checking for the reason why I used MountFlags instead of
PrivateMounts, I found that at the time I wrote the initial version of
this module (Mar 12 06:15:58 2018 +0100) the PrivateMounts option didn't
exist yet and has been added to systemd in Jun 13 08:20:18 2018 +0200.

Signed-off-by: aszlig <aszlig@nix.build>
Copy link
Member

@infinisil infinisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave some review over IRC, which @aszlig addressed with the last two commits. This PR looks good to me now and I'd like to merge this soon.

@aszlig aszlig added the 9.needs: changelog This PR needs a changelog entry label Mar 27, 2019
@aszlig
Copy link
Member Author

aszlig commented Mar 27, 2019

I also added the needs: changelog label, because it should make people more aware of this when writing NixOS modules and we ideally have more services using this option.

Just also checking back with @samueldr and @lheckemann whether we possibly want this in NixOS 19.03.

@samueldr
Copy link
Member

👍 for backporting as this only enhances the feature set of nixos, and does not (at a glance) change existing behaviours.

First of all, the reason I added this to the "highlights" section is
that we want users to be aware of these options, because in the end we
really want to decrease the attack surface of NixOS services and this is
a step towards improving that situation.

The reason why I'm adding this to the changelog of the NixOS 19.03
release instead of 19.09 is that it makes backporting services that use
these options easier. Doing the backport of the confinement module after
the official release would mean that it's not part of the release
announcement and potentially could fall under the radar of most users.

These options and the whole module also do not change anything in
existing services or affect other modules, so they're purely optional.

Adding this "last minute" to the 19.03 release doesn't hurt and is
probably a good preparation for the next months where we hopefully
confine as much services as we can :-)

I also have asked @samueldr and @lheckemann, whether they're okay with
the inclusion in 19.03. While so far only @samueldr has accepted the
change, we can still move the changelog entry to the NixOS 19.09 release
notes in case @lheckemann rejects it.

Signed-off-by: aszlig <aszlig@nix.build>
@aszlig aszlig removed the 9.needs: changelog This PR needs a changelog entry label Mar 27, 2019
@GrahamcOfBorg GrahamcOfBorg added 8.has: changelog 8.has: documentation This PR adds or changes documentation labels Mar 27, 2019
@lheckemann
Copy link
Member

I'm neutral on backporting. On the one hand, as @samueldr says, this doesn't change any existing functionality. On the other hand, it's a new feature and we're well past the feature freeze, which AFAIU is intended so that new, potentially buggy features aren't introduced in the "stabilisation period". It is a cool feature though… :)

@aszlig aszlig self-assigned this Mar 28, 2019
@arianvp
Copy link
Member

arianvp commented Mar 28, 2019

I don't fully understand why this doesn't interact well with DynamicUser. I make heavy use of it and it would be awesome to use it together with this feature. Could you elaborate a bit?

@aszlig
Copy link
Member Author

aszlig commented Mar 28, 2019

@arianvp: See ac64ce9:

Unfortunately, we do have a few limitations as well. The first being
that DynamicUser doesn't work in conjunction with tmpfs, because it
already sets up a tmpfs in a different path and simply ignores the one
we define. We could probably solve this by detecting it and try to
bind-mount our paths to that different path whenever DynamicUser is
enabled.

Edit: I'll try to make it work right now, but I can't promise anything. If it's too complicated to do we can still lift this restriction later.

@aszlig
Copy link
Member Author

aszlig commented Mar 29, 2019

@arianvp: Okay, this is getting a bit more complicated, because systemd doesn't seem to take the TemporaryFilesystem option into account, so we need to create a separate mount unit for working around this behaviour. I'd like to do this in a followup PR though (and maybe even fix this in systemd directly), because this needs a lot more tests to make sure it stays consistent with the current implementation without DynamicUser = true.

pull bot pushed a commit to evanjs/nixpkgs that referenced this pull request Mar 29, 2019
Currently if you want to properly chroot a systemd service, you could do
it using BindReadOnlyPaths=/nix/store or use a separate derivation which
gathers the runtime closure of the service you want to chroot. The
former is the easier method and there is also a method directly offered
by systemd, called ProtectSystem, which still leaves the whole store
accessible. The latter however is a bit more involved, because you need
to bind-mount each store path of the runtime closure of the service you
want to chroot.

This can be achieved using pkgs.closureInfo and a small derivation that
packs everything into a systemd unit, which later can be added to
systemd.packages.

However, this process is a bit tedious, so the changes here implement
this in a more generic way.

Now if you want to chroot a systemd service, all you need to do is:

  {
    systemd.services.myservice = {
      description = "My Shiny Service";
      wantedBy = [ "multi-user.target" ];

      confinement.enable = true;
      serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
    };
  }

If more than the dependencies for the ExecStart* and ExecStop* (which
btw. also includes script and {pre,post}Start) need to be in the chroot,
it can be specified using the confinement.packages option. By default
(which uses the full-apivfs confinement mode), a user namespace is set
up as well and /proc, /sys and /dev are mounted appropriately.

In addition - and by default - a /bin/sh executable is provided, which
is useful for most programs that use the system() C library call to
execute commands via shell.

Unfortunately, there are a few limitations at the moment. The first
being that DynamicUser doesn't work in conjunction with tmpfs, because
systemd seems to ignore the TemporaryFileSystem option if DynamicUser is
enabled. I started implementing a workaround to do this, but I decided
to not include it as part of this pull request, because it needs a lot
more testing to ensure it's consistent with the behaviour without
DynamicUser.

The second limitation/issue is that RootDirectoryStartOnly doesn't work
right now, because it only affects the RootDirectory option and doesn't
include/exclude the individual bind mounts or the tmpfs.

A quirk we do have right now is that systemd tries to create a /usr
directory within the chroot, which subsequently fails. Fortunately, this
is just an ugly error and not a hard failure.

The changes also come with a changelog entry for NixOS 19.03, which is
why I asked for a vote of the NixOS 19.03 stable maintainers whether to
include it (I admit it's a bit late a few days before official release,
sorry for that):

  @samueldr:

    Via pull request comment[1]:

      +1 for backporting as this only enhances the feature set of nixos,
      and does not (at a glance) change existing behaviours.

    Via IRC:

      new feature: -1, tests +1, we're at zero, self-contained, with no
      global effects without actively using it, +1, I think it's good

  @lheckemann:

    Via pull request comment[2]:

      I'm neutral on backporting. On the one hand, as @samueldr says,
      this doesn't change any existing functionality. On the other hand,
      it's a new feature and we're well past the feature freeze, which
      AFAIU is intended so that new, potentially buggy features aren't
      introduced in the "stabilisation period". It is a cool feature
      though? :)

A few other people on IRC didn't have opposition either against late
inclusion into NixOS 19.03:

  @edolstra:  "I'm not against it"
  @infinisil: "+1 from me as well"
  @grahamc:   "IMO its up to the RMs"

So that makes +1 from @samueldr, 0 from @lheckemann, 0 from @edolstra
and +1 from @infinisil (even though he's not a release manager) and no
opposition from anyone, which is the reason why I'm merging this right
now.

I also would like to thank @infinisil, @edolstra and @danbst for their
reviews.

[1]: NixOS#57519 (comment)
[2]: NixOS#57519 (comment)
aszlig added a commit that referenced this pull request Mar 29, 2019
Currently if you want to properly chroot a systemd service, you could do
it using BindReadOnlyPaths=/nix/store or use a separate derivation which
gathers the runtime closure of the service you want to chroot. The
former is the easier method and there is also a method directly offered
by systemd, called ProtectSystem, which still leaves the whole store
accessible. The latter however is a bit more involved, because you need
to bind-mount each store path of the runtime closure of the service you
want to chroot.

This can be achieved using pkgs.closureInfo and a small derivation that
packs everything into a systemd unit, which later can be added to
systemd.packages.

However, this process is a bit tedious, so the changes here implement
this in a more generic way.

Now if you want to chroot a systemd service, all you need to do is:

  {
    systemd.services.myservice = {
      description = "My Shiny Service";
      wantedBy = [ "multi-user.target" ];

      confinement.enable = true;
      serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
    };
  }

If more than the dependencies for the ExecStart* and ExecStop* (which
btw. also includes script and {pre,post}Start) need to be in the chroot,
it can be specified using the confinement.packages option. By default
(which uses the full-apivfs confinement mode), a user namespace is set
up as well and /proc, /sys and /dev are mounted appropriately.

In addition - and by default - a /bin/sh executable is provided, which
is useful for most programs that use the system() C library call to
execute commands via shell.

Unfortunately, there are a few limitations at the moment. The first
being that DynamicUser doesn't work in conjunction with tmpfs, because
systemd seems to ignore the TemporaryFileSystem option if DynamicUser is
enabled. I started implementing a workaround to do this, but I decided
to not include it as part of this pull request, because it needs a lot
more testing to ensure it's consistent with the behaviour without
DynamicUser.

The second limitation/issue is that RootDirectoryStartOnly doesn't work
right now, because it only affects the RootDirectory option and doesn't
include/exclude the individual bind mounts or the tmpfs.

A quirk we do have right now is that systemd tries to create a /usr
directory within the chroot, which subsequently fails. Fortunately, this
is just an ugly error and not a hard failure.

The changes also come with a changelog entry for NixOS 19.03, which is
why I asked for a vote of the NixOS 19.03 stable maintainers whether to
include it (I admit it's a bit late a few days before official release,
sorry for that):

  @samueldr:

    Via pull request comment[1]:

      +1 for backporting as this only enhances the feature set of nixos,
      and does not (at a glance) change existing behaviours.

    Via IRC:

      new feature: -1, tests +1, we're at zero, self-contained, with no
      global effects without actively using it, +1, I think it's good

  @lheckemann:

    Via pull request comment[2]:

      I'm neutral on backporting. On the one hand, as @samueldr says,
      this doesn't change any existing functionality. On the other hand,
      it's a new feature and we're well past the feature freeze, which
      AFAIU is intended so that new, potentially buggy features aren't
      introduced in the "stabilisation period". It is a cool feature
      though? :)

A few other people on IRC didn't have opposition either against late
inclusion into NixOS 19.03:

  @edolstra:  "I'm not against it"
  @infinisil: "+1 from me as well"
  @grahamc:   "IMO its up to the RMs"

So that makes +1 from @samueldr, 0 from @lheckemann, 0 from @edolstra
and +1 from @infinisil (even though he's not a release manager) and no
opposition from anyone, which is the reason why I'm merging this right
now.

I also would like to thank @infinisil, @edolstra and @danbst for their
reviews.

[1]: #57519 (comment)
[2]: #57519 (comment)

(cherry picked from commit dcf40f7)
@aszlig aszlig merged commit ada3239 into NixOS:master Mar 29, 2019
@aszlig aszlig deleted the systemd-chroot branch March 29, 2019 04:16
@infinisil
Copy link
Member

I would've squashed a bit, but oh well

aszlig added a commit to aszlig/avonc that referenced this pull request May 24, 2020
The systemd-chroot module here in this repository has already been
upstreamed[1] quite a while ago as "systemd-confinement" with
"confinement" as the option namespace.

Since we no longer want or need to support older releases than NixOS
19.03, we can safely drop the module on our side and rename the options
accordingly.

[1]: NixOS/nixpkgs#57519

Signed-off-by: aszlig <aszlig@nix.build>
@ghost
Copy link

ghost commented Jun 12, 2021

DynamicUser forces ProtectSystem and ProtectHome, which override TemporaryFilesystem.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-policy-regarding-systemd-confinement/18976/1

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/hardening-systemd-services/17147/18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: enhancement Add something new 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: changelog 8.has: documentation This PR adds or changes documentation 8.has: module (new) This PR adds a module in `nixos/` 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants