Skip to content

nixos/systemd|filesystems: mount and evacuate /sys/fs/pstore using systemd-pstore #85073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from May 16, 2021
Merged

nixos/systemd|filesystems: mount and evacuate /sys/fs/pstore using systemd-pstore #85073

merged 2 commits into from May 16, 2021

Conversation

ghost
Copy link

@ghost ghost commented Apr 12, 2020

Motivation for this change

Currently limited size pstores eventually fill up and logs of any panics stop being saved, making it harder to diagnose them. This enables the systemd-pstore service, which frees them automatically and adds them to the journal.

Design

We can't mount /sys/fs/pstore with the other special file systems, because at that time the kernel hasn't loaded the pstore module yet.
systemd cannot mount virtual file systems that expose APIs, so we can't use a mount unit.
Instead we just create a very simple systemd service that does the job.

On systems without pstore I expect the modprobe service to fail, which doesn't look nice in the journal, but I'm not aware of a good alternative.

Left for another time is making usage of the EFI variable pstore optional, which it currently is not even without this PR (this one just adds the vacuuming instead of only dumping into it).

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

Sorry, something went wrong.

@ofborg ofborg bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 8.has: clean-up 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux and removed 10.rebuild-linux: 1-10 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Apr 12, 2020
@ghost

This comment has been minimized.

@ghost ghost changed the title nixos/systemd: enable systemd-pstore service nixos/filesystems: mount and evacuate /sys/fs/pstore using systemd-pstore Apr 12, 2020
@ghost ghost marked this pull request as ready for review April 12, 2020 20:55
@ghost
Copy link
Author

ghost commented May 27, 2020

For some reason the dmsg's cleaned up by this are not showing up in the journal but only in /var/lib/systemd/pstore; which is the one configuration that's supposedly impossible according to the man page.
My current assumption is that upstream is broken somehow, and be it only in not reporting an error when it fails to write to the journal.

@stale
Copy link

stale bot commented Nov 24, 2020

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 24, 2020
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Dec 1, 2020
@ofborg ofborg bot added 8.has: changelog 8.has: documentation This PR adds or changes documentation 2.status: merge conflict This PR has merge conflicts with the target branch and removed 2.status: merge conflict This PR has merge conflicts with the target branch labels Dec 1, 2020
@ghost
Copy link
Author

ghost commented Dec 1, 2020

I haven't really figured out what's going on with things not showing up in the journal (and I don't really know enough about the systemd journal to easily do so), but I'm certain enough that it's not an issue caused by this PR that I feel it should be merged.
The situation isn't worsened, it just isn't improved as much as it should be, which could then be a followup issue.

@ghost ghost requested review from Ericson2314, matthewbauer and nbp as code owners December 1, 2020 23:58
@ghost
Copy link
Author

ghost commented Dec 2, 2020

Sorry about the review requests, I had accidentally changed identity of a commit that's part of master when splitting up the PR into commits. EDIT: Did it again. At least now I know why.

@ghost
Copy link
Author

ghost commented Dec 2, 2020

Okay, I'm pretty sure this is the one. I will be staying far away from git and systemd for at least the next month -_-

@ajs124 ajs124 requested a review from flokli December 2, 2020 02:53
@flokli flokli requested a review from puckipedia December 7, 2020 19:14
@flokli
Copy link
Contributor

flokli commented Dec 9, 2020

According to the release notes of systemd 198:

    * The pstore file system is now mounted by default, if it is
     available.

this should already be mounted automatically.

I wonder what's doing this on other distros? Or does it only happen with systemd-in-initrd?

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

in listToAttrs (map formatDevice (filter (fs: fs.autoFormat) fileSystems));
in listToAttrs (map formatDevice (filter (fs: fs.autoFormat) fileSystems)) // {
# Mount /sys/fs/pstore for evacuating panic logs and crashdumps from persistent storage onto the disk using systemd-pstore.
# This cannot be done with the other special filesystems because the pstore module is not loaded at that point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another comment line explaining this is usually set up by systemd in initrd, so we know we can remove it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly I don't understand enough about how and why special filesystems are done the way they are in NixOS so I'm not sure what kind of explanatory comment to add that wouldn't be possibly misleading. I think the current comment indicates well enough that it is like the other special filesystems, and save for the mentioned exception it should be handled the same way.

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

After bisecting my system configuration it does look like this PR broke something. The error messages I get from switching are:

warning: the following units failed: mount-pstore.service

● mount-pstore.service
     Loaded: loaded (/nix/store/...-unit-mount-pstore.service/mount-pstore.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2021-05-18 08:48:56 EDT; 1s ago
    Process: 40869 ExecStart=/nix/store/...-util-linux-2.36.2-bin/bin/mount -t pstore -o nosuid,noexec,nodev pstore /sys/fs/pstore (code=exited, status=32)
   Main PID: 40869 (code=exited, status=32)
         IP: 0B in, 0B out
        CPU: 1ms

May 18 08:48:56 albatross systemd[1]: Starting mount-pstore.service...
May 18 08:48:56 albatross mount[40869]: mount: /sys/fs/pstore: pstore already mounted on /sys/fs/pstore.
May 18 08:48:56 albatross systemd[1]: mount-pstore.service: Main process exited, code=exited, status=32/n/a
May 18 08:48:56 albatross systemd[1]: mount-pstore.service: Failed with result 'exit-code'.
May 18 08:48:56 albatross systemd[1]: Failed to start mount-pstore.service.

@ghost
Copy link
Author

ghost commented May 18, 2021

You have a service mounting the persistent store already? Is it to the same place or somewhere else? You can simply set systemd.services.mount-pstore.enable = false to prevent that failure (which doesn't affect anything else - the only thing 'broken' or 'failing' is that very service).

Alternatively you can disable whatever is mounting the persistent store (if feasible, curious what that is) and leave it to this service.

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

Let me see if I can figure out what is mounting pstore. I wasn't even aware of pstore until this morning!

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

Interesting, it doesn't look like anything depends on it, at least from the systemd perspective:

❯ sudo systemctl list-dependencies --reverse mount-pstore.service
mount-pstore.service

@ghost
Copy link
Author

ghost commented May 18, 2021

The only other thing that could be going wrong is that the ExecStartPost stanza is failing (it really shouldn't be able to!) and what you're seeing is the effect of systemd trying to restart the service. Can you make sure you get us logs from all attempts to start the service?

@ghost
Copy link
Author

ghost commented May 18, 2021

it doesn't look like anything depends on it, at least from the systemd perspective

That really shouldn't be. If nothing depends on it, how did it get started??

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

@hyperfekt Is there a way to use nix why-depends here to discover dependents of mount-pstore.service?

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

Ok, after setting

systemd.services.mount-pstore.enable = false

and deploying, and then setting it back to true (and deploying) I cannot reproduce this anymore.

A couple of things I noticed:

  1. The permissions of the existing /sys/fs/pstore mount were different from the permissions of the same mount after deploying from this PR's commit. The original permissions were: drwxr-xr-x, while the new permissions are drwxr-x---.
  2. The result of the command I ran above to show dependents of mount-pstore.service is different upon deploying from this commit:
❯ sudo systemctl list-dependencies --reverse mount-pstore.service
mount-pstore.service
● └─systemd-pstore.service

@wizardwatch
Copy link
Contributor

I also experience the same bug where mount-pstore fails to start. I have not attempted to fix it yet.

@zhaofengli
Copy link
Member

I encountered the same problem as well on my machines. If I'm understanding correctly, it appears that systemd mounts pstore automatically.

@ymatsiuk
Copy link
Contributor

Seems like systemd added support for pstore few months after this PR has been submitted. @hyperfekt do you still think this PR is required?

iagocq added a commit to iagocq/nixos-config that referenced this pull request May 21, 2021
@superherointj
Copy link
Contributor

superherointj commented May 21, 2021

It affected me on unstable and master. On nixos-rebuild switch:

● mount-pstore.service
     Loaded: loaded (/nix/store/rszfkvs6jjz6xi2016n48aay23a5n75m-unit-mount-pstore.service/mount-pstore.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2021-05-21 07:28:55 -03; 155ms ago
    Process: 77852 ExecStart=/nix/store/5kw5x24rbwv8gp2mgr75vd0i4nkp2zvz-util-linux-2.36.2-bin/bin/mount -t pstore -o nosuid,noexec,nodev pstore /sys/fs/pstore (code=exited, status=32)
   Main PID: 77852 (code=exited, status=32)
         IP: 0B in, 0B out
        CPU: 10ms

mai 21 07:28:55 k1 systemd[1]: Starting mount-pstore.service...
mai 21 07:28:55 k1 mount[77852]: mount: /sys/fs/pstore: pstore already mounted on /sys/fs/pstore.
mai 21 07:28:55 k1 systemd[1]: mount-pstore.service: Main process exited, code=exited, status=32/n/a
mai 21 07:28:55 k1 systemd[1]: mount-pstore.service: Failed with result 'exit-code'.
mai 21 07:28:55 k1 systemd[1]: Failed to start mount-pstore.service.
warning: error(s) occurred while switching to the new configuration

Tried workaround: systemd.services.mount-pstore.enable = false;
It silenced error messages but on reactivation it reappeared.

Shouldn't a new issue be opened to track this? This PR is merged and we have an issue.

ymatsiuk added a commit to ymatsiuk/nixpkgs that referenced this pull request May 21, 2021
@ymatsiuk ymatsiuk mentioned this pull request May 21, 2021
9 tasks
@ymatsiuk
Copy link
Contributor

In case decision is made to revert this, I've created this #123889

@ghost
Copy link
Author

ghost commented May 21, 2021

I'm currently investigating this.
It definitely isn't because systemd is already mounting pstore, because that line has not just been part of systemd for 8 years, it also doesn't have an effect on NixOS - which is why we have a separate stanza in one of those files for reimplementing that part of systemd in NixOS, from which /sys/fs/pstore is missing.
On my system, /sys/fs/pstore is not mounted without this PR, and I am now upgrading it to unstable to see if something goes wrong specifically during activation (attempting to start it twice maybe?).
If that's not the case, something else must be mounting /sys/fs/pstore and we need to find out what it is, because it's only present for a subset of users. One way to do so would be to start running auditd early in the boot.

@superherointj
Copy link
Contributor

On my system, /sys/fs/pstore is not mounted without this PR, and I am now upgrading it to unstable to see if something goes wrong specifically during activation (attempting to start it twice maybe?).

As your PR targets master, you should be testing it/your system on master.

@ghost
Copy link
Author

ghost commented May 21, 2021

I agree, I probably should have tested the activation itself in a VM instead of relying on the ISO (where no switch occurs) to show any problem.

@ymatsiuk
Copy link
Contributor

I can't reproduce this in the VM (pstore is not present there). And I fixed it just by umount and running rebuild again. Now everything works. So I would expect we could've avoided this in the future with simple check in ExecStartPre that does umount in case it is mounted (for the specific use case when someone/something did mount pstore)

@ghost
Copy link
Author

ghost commented May 21, 2021

It's been common in the past for some units to fail upon activation after a nixos-rebuild switch, which is why I wasn't concerned about avoiding every instance of it. Of course it shouldn't be happening to every user of NixOS, which it appears this does.
I've tried this out and while I can reproduce it happening when switching, it also isn't mounted without the service, and I can't see any hint as to why it might already be. On a subsequent switch that does nothing but enable the unit, no such failure is observed. Unfortunately I can't seem to get auditd to pick up the mount syscalls, so I'm currently a bit stumped as to why it happens. I will however open a PR to introduce a condition that checks if the persistent store is already mounted at that location, so noone else who upgrades in the future sees the associated noise (and has to wonder whether their system is now broken).

@zhaofengli
Copy link
Member

For me, the failure also occurs on boot, so it's not just happening on a switch for "people with existing /sys/fs/pstore".

@superherointj
Copy link
Contributor

Wouldn't be better to reverse it? So you can calmly investigate it?

@ghost
Copy link
Author

ghost commented May 21, 2021

I don't see any reason to revert it if the issue is solved by the new PR. In general no breakage is involved, just log noise. And if there is a solution necessary as a result of the investigation, it would be additive, not alternative.

@superherointj
Copy link
Contributor

On the workaround that umount in case it is mounted, it seems to be dealing with symptoms rather than root cause.

@ghost
Copy link
Author

ghost commented May 21, 2021

If you check #123902 you will see that's not what I did.

@ghost
Copy link
Author

ghost commented May 21, 2021

For me, the failure also occurs on boot

@zhaofengli If you find out what is responsible for having /sys/fs/pstore mounted the first time, I'd be glad. I am still curious about any instance where this is the case and it's from a package instead of an individual modification.

@zhaofengli
Copy link
Member

Finally have some time to poke at this. It appears that /sys/fs/pstore will automatically be mounted if pstore is built into the kernel (CONFIG_PSTORE=y) or it's loaded prior to /sys being mounted the first time (in stage 1, boot.kernelModules are loaded after the special filesystems are mounted).

The machine I initially encountered the on-boot failure is an aarch64 host running a custom kernel with CONFIG_PSTORE=y. With mount-pstore disabled:

[root@serotina:~]# uname -a
Linux serotina 5.12.5 #1-NixOS SMP Tue Jan 1 00:00:00 UTC 1980 aarch64 GNU/Linux

[root@serotina:~]# systemctl status mount-pstore
● mount-pstore.service
     Loaded: masked (Reason: Unit mount-pstore.service is masked.)
     Active: inactive (dead)

[root@serotina:~]# mount | grep pstore
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)

[root@serotina:~]# zcat /proc/config.gz | grep CONFIG_PSTORE=
CONFIG_PSTORE=y

On the regular kernel, pstore is built as a module:

[root@malus:~]# uname -a
Linux malus 5.10.37 #1-NixOS SMP Fri May 14 07:50:46 UTC 2021 x86_64 GNU/Linux

[root@malus:~]# systemctl status mount-pstore
● mount-pstore.service
     Loaded: masked (Reason: Unit mount-pstore.service is masked.)
     Active: inactive (dead)

[root@malus:~]# mount | grep pstore

[root@malus:~]# zcat /proc/config.gz | grep CONFIG_PSTORE=
CONFIG_PSTORE=m

[root@malus:~]# lsmod | grep pstore
efi_pstore             16384  0
pstore                 28672  1 efi_pstore

On a machine with CONFIG_PSTORE=y, I used an audit rule to trace all mount syscalls:

Audit configurations
{
  boot.kernelParams = [ "audit=1" "audit_backlog_limit=500" ];
  security.audit = {
    enable = true;
    rules = [
      "-a exit,always -S mount"
    ];
  };
  security.auditd.enable = true;
  # Add rule before systemd starts
  boot.initrd.postDeviceCommands = ''
    ${pkgs.audit}/bin/auditctl -a exit,always -S mount
  '';
}

Looking at the audit logs I couldn't find a call that specifically mounted /sys/fs/pstore so I don't think it's coming from userspace. If I modify stage 1 and nixos/modules/tasks/filesystems.nix to mount /sys after boot.kernelModules are loaded, /sys/fs/pstore is automatically mounted.

However, this doesn't explain why the initial activation will cause the failure. Remounting /sys after pstore is loaded doesn't appear to automatically mount /sys/fs/pstore, so we're missing something here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: changelog 8.has: clean-up 8.has: documentation This PR adds or changes documentation 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants