Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not kill udev during boot #40230

Merged
merged 2 commits into from May 11, 2018
Merged

Do not kill udev during boot #40230

merged 2 commits into from May 11, 2018

Conversation

ngortheone
Copy link
Contributor

@ngortheone ngortheone commented May 9, 2018

Fixes #39867

So main culprit was udevadm control --exit || true during initrd stage. I don't know what kind of fix was that, but it is definitely outdated, and in generally killing udev is not a good idea.
Basically what was happening - when nvme device did not appear immediately and udev was killed there were no links to device created in /dev/device-by* that was causing inability to mount root volume.

NVME devices may appear with some delay, so I had to add waitDevice to growpartition script, to ensure that root volume is resized even when device is delayed.

Tested on m5 and m4 instances.

[ $try -ne 0 ]
fi
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks like you just moved waitDevice() around without making any real change. What's the point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i needed it to be declared before @postDeviceCommands@ so I can use the func in growPartition

@ngortheone
Copy link
Contributor Author

@edolstra @dezgeg

@dezgeg
Copy link
Contributor

dezgeg commented May 9, 2018

Looks good.

FWIW the original problem is probably related to udev having the block device open for probing filesystems on it (like described at https://groups.google.com/forum/#!topic/scylladb-dev/u87yHgo3ylU).

@copumpkin
Copy link
Member

@dezgeg so is not killing udev likely to bite us in other subtle ways later? It definitely seemed to be hurting our NVMe discovery on EC2, but it's unclear to me how to make sure the other issue doesn't pop up again. We already ask it to settle at various points along the way.

@dezgeg
Copy link
Contributor

dezgeg commented May 9, 2018

Yeah, probably the settles that are added are now enough.

@xeji
Copy link
Contributor

xeji commented May 9, 2018

Does this need a backport to 18.03 as well?

@edolstra
Copy link
Member

Looks good to me.

I would hold off on backporting to 18.03 until we feel confident that the race with udev that prompted the addition of the udevadm control --exit is really gone.

@dezgeg
Copy link
Contributor

dezgeg commented May 10, 2018

I would think the udev blkid race could be avoided by wrapping certain parts in udevadm control --stop-exec-queue and udevadm control --start-exec-queue as was proposed in the scylladb thread (and in fact would be the only way to fix the similar races that sometimes happen in NixOS installer tests).

@dezgeg dezgeg merged commit 08ebd83 into NixOS:master May 11, 2018
@dezgeg
Copy link
Contributor

dezgeg commented May 12, 2018

For reference (to confirm that I am not talking crazy stuff :), these are the occasional failures I see in the installer tests which I believe to be related to the udev race: https://nix-cache.s3.amazonaws.com/log/rwfcwv4p18sbbnd7pna2r2m9aqz7s2m5-vm-test-run-installer-luksroot.drv

machine: must succeed: parted --script /dev/vda -- mkpart primary linux-swap 50M 1024M
machine# [   35.973121] systemd[1]: Started Networking Setup.
machine# [   35.978131] systemd[1]: Starting Extra networking commands....
machine# [   36.549789]  vda: vda1 vda2
machine# Error: Partition(s) 2 on /dev/vda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS KVM instances fail to boot because NVMe device doesn't appear
6 participants