Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/rasdaemon: init with 0.6.7 #85039

Merged
merged 6 commits into from Oct 31, 2021
Merged

nixos/rasdaemon: init with 0.6.7 #85039

merged 6 commits into from Oct 31, 2021

Conversation

evils
Copy link
Member

@evils evils commented Apr 11, 2020

Motivation for this change

closes #42592
based on #73149

Things done

packaged rasdaemon
wrote rasdaemon module
wrote rasdaemon module test

  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
    • /nix/store/05rr2kjhn6dim1cfb1lbhxyjy5wj46pm-rasdaemon-0.6.6-21-gb4764d4 99608016
    • /nix/store/kshci3w68i6w8wcs3s0zxm366id7c2i9-rasdaemon-0.6.6-21-gb4764d4-dev 99675864
    • /nix/store/ni6m29zk553d32icvb090k512dg15yn1-rasdaemon-0.6.6-21-gb4764d4-man 4280
    • /nix/store/ngya58z9dmkzwba4vw0jxgm039565yl9-rasdaemon-0.6.6-21-gb4764d4-inject 34278856
    • /nix/store/p9xlqzy0sk1szk3rakxc8vlj8q16inqi-aer-inject-9bd5e2c 32976656
    • /nix/store/kpkq3h16m9mblcs62ncwyq7ansmrl4xz-mce-inject-4cbe463 32983512
    • /nix/store/hg0pjs961ha70s2pjhj9ssxxx21r4giz-vm-tools-5.10.30 33091824
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
Status

help wanted

  • the rasdaemon package and its module work
    • accept DIMM labels lines as rasdaemon.labels = " ";
      • accept multiple inputs like rasdaemon.labels.vendor = " "; as well
    • accept mainboard definition
  • testing (can't do so in a VM...)
    • error injection
      • edac, aer, mce error injection tools packaged
        • test their functionality (can't test mce-inject on my AMD system)
      • mce-test packaging (translating this to something that works with nix looks tricky)
        • get the source and dependencies
          • vm-tools and mce-inject packaged
      • set up kernel with required configuration
        • determine what config options are needed
          • aer-inject: "Requires a new Linux kernel with PCIE AER error injection patches."
            • TBD where to get those
          • mce-inject: "Requires a Linux 2.6.31+ kernel with CONFIG_X86_MCE_INJECT enabled and the mce-inject module loaded (if not built in)"
          • edac-inject: "CONFIG_EDAC_DEBUG and a running EDAC driver"
          • mce-test: at least: "a kernel with CONFIG_X86_MCE_INJECT and CONFIG_HWPOISON_INJECT and soft-offlining support"
    • write nixos test
      • only for the module, no error injection (yet)

@amaxine
Copy link
Member

amaxine commented Apr 12, 2020

determine what config options are needed

CONFIG_EDAC_DEBUG is needed for testing.

DIMM labels

The format is:

Vendor: <vendor-name>
  Product: <product-name>
  Model: <model-name>
    <label>:  <mc>.<top>.<mid>.<low>

Have never actually bothered decoding these, but a rough example would be:

Vendor: Dell Inc.
  Product: PowerEdge R740
  Model: 08D89F
    DIMM_A1:  0.0.0; DIMM_A2:  0.1.0;

Though those labels are nonsensical, but they do provide output:

ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 channel 0 slot 0                DIMM_A1              CPU_SrcID#0_MC#0_Chan#0_DIMM#0
mc0 channel 1 slot 0                DIMM_A2              CPU_SrcID#0_MC#0_Chan#1_DIMM#0

Both Product and Model can be a comma delimited list.

I hope this helps a bit.

@evils
Copy link
Member Author

evils commented Apr 12, 2020

determine what config options are needed

CONFIG_EDAC_DEBUG is needed for testing.

that's mostly a note to myself, the dependencies need some config too and i should summarize them for when i get around to writing the nixos test

DIMM labels

The format is:

I wasn't sure if the content at /etc/ras/dimm_labels.d/ was indeed that of their example labels or some generated output because the example ones don't get installed (at least on debian)

I'll probably link to the dell file and any user supplied entries in /etc/ras/dimm_labels.d/
and to a mainboard definition in /etc/ras/mainboard, which unlike what the man page says, seems to be located there
(everything related to labels and mainboard seems to be reused from edac-utils, hence the remnant /etc/edac/ from edac-ctl(8))

@maxeaubrey thanks for the --print-labels output, it clarifies what the "label" numbers mean, could you show me the output of sudo ras-mc-ctl --guess-labels?

@amaxine
Copy link
Member

amaxine commented Apr 13, 2020

The provided examplels are not useful without models (or a mainboard file to overwrite the name, I guess? Haven't actually double checked the code, or verified how that works in practice), which is why I assume they do not get installed - they do not work as is in my experience at least.

I've not seen --guess-labels work frequently, if at all, but here's a Dell example:

memory stick 'A1' is located at 'Not Specified'            
memory stick 'A2' is located at 'Not Specified'            
memory stick 'A3' is located at 'Not Specified'                                
memory stick 'A4' is located at 'Not Specified'            
memory stick 'A5' is located at 'Not Specified'            
memory stick 'A6' is located at 'Not Specified'               
memory stick 'A7' is located at 'Not Specified'           
memory stick 'A8' is located at 'Not Specified'           
memory stick 'A9' is located at 'Not Specified'           
memory stick 'A10' is located at 'Not Specified'              
memory stick 'A11' is located at 'Not Specified'           
memory stick 'A12' is located at 'Not Specified'                         
memory stick 'B1' is located at 'Not Specified'                        
memory stick 'B2' is located at 'Not Specified'                                
memory stick 'B3' is located at 'Not Specified'            
memory stick 'B4' is located at 'Not Specified'            
memory stick 'B5' is located at 'Not Specified'               
memory stick 'B6' is located at 'Not Specified'            
memory stick 'B7' is located at 'Not Specified'            
memory stick 'B8' is located at 'Not Specified'            
memory stick 'B9' is located at 'Not Specified'            
memory stick 'B10' is located at 'Not Specified'
memory stick 'B11' is located at 'Not Specified'
memory stick 'B12' is located at 'Not Specified'

And here's a somewhat more useful Supermicro example.

memory stick 'P1_DIMMA1' is located at 'P0_Node0_Channel0_Dimm0'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'P1_DIMMB1' is located at 'P0_Node0_Channel1_Dimm0'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'P1_DIMMC1' is located at 'P0_Node0_Channel2_Dimm0'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'P2_DIMME1' is located at 'P1_Node1_Channel0_Dimm0'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'P2_DIMMF1' is located at 'P1_Node1_Channel1_Dimm0'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'P2_DIMMG1' is located at 'P1_Node1_Channel2_Dimm0'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'NO DIMM' is located at 'NO DIMM'
memory stick 'NO DIMM' is located at 'NO DIMM'

I've not seen Dell/HP servers return more useful info than that, though that's with the caveat of only trying a limited set of Dell and HP servers.

@evils
Copy link
Member Author

evils commented Apr 23, 2020

i consider the state of rasdaemon and nixos/rasdaemon "minimally viable"
if anyone wants to merge it i'm willing to separate the testing commits to a separate PR

@stale
Copy link

stale bot commented Nov 3, 2020

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 3, 2020
@Luflosi
Copy link
Contributor

Luflosi commented Jan 30, 2021

What's missing for this to get merged?

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 30, 2021
@evils
Copy link
Member Author

evils commented Jan 30, 2021

for just nixos/rasdaemon, i think just some minor cleanup to get the tests out of the branch
and i'd have to switch it to lib instead of using stdenv.lib
(and i've got a local change to github for the source since there was an outage at infradead at some point)

for the tests, i'm not sure, i'll see about that

@evils evils force-pushed the rasdaemon branch 2 times, most recently from 61c34b5 to b73c012 Compare February 2, 2021 01:44
@evils
Copy link
Member Author

evils commented Feb 2, 2021

changed almost everything a bit

switched to rasdaemon to v0.6.6-18-gc329012
v0.6.6 introduced Corrected Error Preemptive Failure Analysis, a feature to disable memory pages with recurring errors
the configuration for when a page is disabled is set via /etc/sysconfig/rasdaemon which is hard coded in the latest release
therefore, moved to latest master commit (though it's an intermediate commit that makes this configurable)

edac and aer tests look like they may work, except the machine features required for them are not present in the nixosTest VM

the mce test seems to assume mcelog is being used,
mce-test may be applicable for this application, unfortunately the package doesn't yield an output
(compilation failure without nix build failure...)
i don't recall this happening the last time i looked at it...

i'll try to run the tests on bare metal, though that may take me a while to get around to
if anyone knows how to get them working in a VM, that'd be nice.

@evils
Copy link
Member Author

evils commented Oct 7, 2021

added release notes entry to the module init commit
missed because this PR predates the current CONTRIBUTING.md

@evils
Copy link
Member Author

evils commented Oct 7, 2021

rebased on master where literalExample is deprecated (and switched to using literalExpression instead)
(also removed a trailing semicolon in the extraModules example)

@tomberek
Copy link
Contributor

tomberek commented Oct 8, 2021

@SuperSandro2000 are the style suggestions here a blocking issue?

Often Sando's reviews are for style and consistency, but not strictly blocking. Merging of a new module usually requires a more functional review and someone familiar with the service or software in question.

@tomberek
Copy link
Contributor

tomberek commented Oct 8, 2021

FWIW: software builds and runs. Test has the following output:

vm-test-run-rasdaemon> machine # [    6.275189] rasdaemon[892]: rasdaemon: ras:mc_event event disabled
vm-test-run-rasdaemon> machine # [    6.278076] rasdaemon[892]: rasdaemon: ras:aer_event event disabled
vm-test-run-rasdaemon> machine # [    6.280894] rasdaemon[892]: rasdaemon: mce:mce_record event disabled
vm-test-run-rasdaemon> machine # [    6.281696] rasdaemon[892]: rasdaemon: ras:extlog_mem_event event disabled
vm-test-run-rasdaemon> machine # [    6.285062] rasdaemon[892]: rasdaemon: ras:non_standard_event event disabled
vm-test-run-rasdaemon> machine # [    6.287613] rasdaemon[892]: rasdaemon: devlink:devlink_health_report event disabled
vm-test-run-rasdaemon> machine # [    6.290071] rasdaemon[892]: rasdaemon: block:block_rq_complete event disabled
vm-test-run-rasdaemon> machine # [    6.291779] rasdaemon[892]: rasdaemon: Can't write to set_event
vm-test-run-rasdaemon> machine # [    6.293090] dhcpcd[878]: eth0: deleting route to 10.0.2.0/24
vm-test-run-rasdaemon> machine # [    6.295242] systemd[1]: Stopping the RAS logging daemon...
vm-test-run-rasdaemon> machine # [    6.296357] rasdaemon[671]: rasdaemon: Recevied signal=15
vm-test-run-rasdaemon> machine # [    6.297358] rasdaemon[671]: overriding event (1180) ras:mc_event with new print handler
vm-test-run-rasdaemon> machine # [    6.298382] rasdaemon[671]: overriding event (1177) ras:aer_event with new print handler
vm-test-run-rasdaemon> machine # [    6.299379] rasdaemon[671]: overriding event (1178) ras:non_standard_event with new print handler
vm-test-run-rasdaemon> machine # [    6.299974] rasdaemon[671]: overriding event (110) mce:mce_record with new print handler
vm-test-run-rasdaemon> machine # [    6.300584] rasdaemon[671]: overriding event (1181) ras:extlog_mem_event with new print handler
vm-test-run-rasdaemon> machine # [    6.301494] rasdaemon[671]: overriding event (1265) net:net_dev_xmit_timeout with new print handler
vm-test-run-rasdaemon> machine # [    6.302305] rasdaemon[671]: overriding event (1274) devlink:devlink_health_report with new print handler
vm-test-run-rasdaemon> machine # [    6.302938] rasdaemon[671]: overriding event (1039) block:block_rq_complete with new print handler
vm-test-run-rasdaemon> machine # [    6.303781] rasdaemon[671]: Calling ras_mc_event_opendb()
vm-test-run-rasdaemon> machine # [    6.304270] rasdaemon[671]: Calling ras_mc_event_closedb()
vm-test-run-rasdaemon> machine # [    6.305084] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[894]: No rules
vm-test-run-rasdaemon> machine # [    6.305827] dhcpcd[878]: eth0: deleting default route via 10.0.2.2
vm-test-run-rasdaemon> machine # [    6.306822] systemd[1]: reload-systemd-vconsole-setup.service: Deactivated successfully.
vm-test-run-rasdaemon> machine # [    6.307859] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: enabled 0
vm-test-run-rasdaemon> machine # [    6.308417] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: failure 1
vm-test-run-rasdaemon> machine # [    6.308972] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: pid 0
vm-test-run-rasdaemon> machine # [    6.309513] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: rate_limit 0
vm-test-run-rasdaemon> machine # [    6.310129] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: backlog_limit 64
vm-test-run-rasdaemon> machine # [    6.310790] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: lost 0
vm-test-run-rasdaemon> machine # [    6.311399] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: backlog 0
vm-test-run-rasdaemon> machine # [    6.311963] ahx2l32ys40zhm4v6aca3k2mzkzi2p1i-audit-stop[900]: backlog_wait_time 60000

@evils
Copy link
Member Author

evils commented Oct 8, 2021

rebased onto master due to release notes merge conflict
for the sake of easing future rebases i've separated the release notes update into its own commit

@evils
Copy link
Member Author

evils commented Oct 17, 2021

fixed typo in the ras-mc-ctl service (ExectStart -> ExecStart)
rebased onto latest nixos-unstable
resolved another release notes conflict

@roberth roberth mentioned this pull request Oct 19, 2021
12 tasks
Copy link
Contributor

@danderson danderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing tonight (+8-10h from now) on a couple of Xeon systems and a 5th gen Ryzen. Thanks for putting this together!

nixos/modules/services/hardware/rasdaemon.nix Show resolved Hide resolved
nixos/modules/services/hardware/rasdaemon.nix Show resolved Hide resolved
# edac_core and amd64_edac should get loaded automatically
# i7core_edac may not be, and may not be required, but should load successfully
"edac_core"
"amd64_edac"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amd64_edac here makes initrd building fail for me when I set hardware.rasdaemon.testing = true. AFAICT, CONFIG_EDAC_AMD64 isn't enabled on the standard NixOS kernels, so initrd building fails when it tries to incorporate it. Tried with both a 4.19 and a 5.10 kernel, both failed in the same way.

Doing another build run now with a kernel patch to enable EDAC_AMD64, to see if that fixes it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to have fixed it :( AFAICT amd64_edac just doesn't get built for me, and as a result building the initrd fails down the road.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd, besides my generated hardware-configuration.nix having kvm-amd, i don't think there's anything AMD specific in my config (nixos-unstable on ryzen)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If hydra doesn't complain while building, I think it's fine to assume my system is wonky somehow. This is only relevant when building for fault injection testing, so, meh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, at some point i had to change the module name from amd64_edac_mod to amd64_edac
i'm not sure when/where that changed, but maybe it's still the old name for you?

nixos/modules/services/hardware/rasdaemon.nix Show resolved Hide resolved
@danderson
Copy link
Contributor

danderson commented Oct 20, 2021

Checked linux 4.19, 5.10 and 5.14:

  • devlink works on 5.10 and 5.14, the subsystem doesn't exist in 4.x.
  • memory_failure_event doesn't work on any of the three, afaict because CONFIG_MEMORY_FAILURE isn't enabled in the NixOS kernel config. Despite the name, this isn't logging events related to ECC errors, but events related to the kernel trying to survive uncorrectable ECC errors (e.g. by realizing the error is in a harmless buffer cache page it can dump and quarantine).

In both cases, the errors are benign: rasdaemon is correctly reporting that some kernels are missing some features that it could monitor, if you enabled them. But it keeps logging everything else it can.

So, LGTM on my machines. It can see a whole bunch of hardware stuff, maps my DIMM layout correctly, and generally works as well as the running kernel allows it to. I look forward to seeing this merged so I can enable it on my servers :)

@evils
Copy link
Member Author

evils commented Oct 29, 2021

resolved merge conflict due to release notes

Copy link
Contributor

@Luflosi Luflosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been using this module for over half a year and it LGTM.

@cyplo
Copy link
Contributor

cyplo commented Oct 30, 2021

I think it needs to be @SuperSandro2000 to approve as they requested changes before ?

@SuperSandro2000
Copy link
Member

I think it needs to be @SuperSandro2000 to approve as they requested changes before ?

We don't handle that like this. I don't have the time to double check if my suggestions where applied all the time. If someone else thinks the PR is in a good state and can be merged he/she can go ahead and merge it.

@tomberek tomberek merged commit 27ba20d into NixOS:master Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Package request: Rasdaemon
8 participants