Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/tests/installer: prevent race between parted and udev #47155

Merged
merged 3 commits into from Sep 24, 2018

Conversation

xeji
Copy link
Contributor

@xeji xeji commented Sep 21, 2018

Motivation for this change

Our installer tests still suffer from non-deterministic failure on Hydra (18.09 and master), which often delays channels.

One typical symptom of non-deterministic failure seems to be this:

Successful example:

machine: must succeed: parted --script /dev/vda -- mkpart primary linux-swap 1M 1024M
machine# [    9.146921]  vda:
machine: exit status 0
machine: must succeed: parted --script /dev/vda -- mkpart primary ext2 1024M -1s
machine: exit status 0

Failed example:

machine: must succeed: parted --script /dev/vda -- mkpart primary linux-swap 1M 1024M
machine# [   10.560431]  vda: vda1
machine# Error: Partition(s) 1 on /dev/vda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
machine# [   10.572443]  vda: vda1
machine: exit status 1
machine: output: 
error: command `parted --script /dev/vda -- mkpart primary linux-swap 1M 1024M' did not succeed (exit code 1)
command `parted --script /dev/vda -- mkpart primary linux-swap 1M 1024M' did not succeed (exit code 1)

The failed example looks like udev recognizes the new partition before parted tries to tell the kernel about it. I found some reports that confirm a race between udev and parted, like this one. This can happen between separate parted calls, so it is recommended to combine multiple parted commands into a single parted call to create the correct partition layout in one pass without udev interfering.

And that's what this PR does. The change itself is a series of trivial rewrites.
It should eliminate this particular cause of non-deterministic failure (but probably there are others, so don't expect zero failures from now on 馃槃).

Relevant for ZHF #45960, please backport to 18.09

WIP because I still like to run all the tests locally once, which will take some time. done

Things done
  • run all installer tests locally

by combining all parted commands into a single parted call.
This eliminates one cause of non-deterministic failure.
@dezgeg
Copy link
Contributor

dezgeg commented Sep 22, 2018

I've noticed the races myself at some point as well:
#40230 (comment)

I have a fear that even after this a similar race condition exists. Namely, udev can still be holding the partition devices open during the time mkfs programs or mdadm are run and want to open the partition for exclusive access, leading to Device or resource busy errors. See e.g. this thread: https://groups.google.com/forum/#!topic/scylladb-dev/u87yHgo3ylU

@xeji
Copy link
Contributor Author

xeji commented Sep 22, 2018

It will be difficult to eliminate all causes of such races. Just looking at how they tried to fix this in parted by "sleep-and-retry" doesn't look promising.

Two things we can do to further reduce the probability of failure:

  • Wrap each parted (and maybe also mkfs etc.) call with flock /dev/vda parted ... to acquire an exclusive lock on the device. That's what util-linux recommends for sfdisk. Doesn't work with device-mapper devices like md though.

  • For the swraid test which is the only one using mdadm, disable udev queue execution with udevadm control --stop-exec-queue like mentioned in the google groups thread above.

In the swraid test, temporarily stop udev queue execution while
creating mdraid devices to prevent a race with udev, see
https://groups.google.com/forum/#!topic/scylladb-dev/u87yHgo3ylU
@xeji
Copy link
Contributor Author

xeji commented Sep 22, 2018

We could also protect all calls to mkfs/mkswap in a similar way but I don't recall seeing these fail in our tests yet, so let's fix those failures when they happen.

@xeji
Copy link
Contributor Author

xeji commented Sep 22, 2018

Example data point: In the latest 18.09 Hydra eval there are 5 installer test failures blocking the channel, and all of them show the kind of error addressed by this PR. So these changes can really improve things.

@xeji xeji merged commit 9163c05 into NixOS:master Sep 24, 2018
@xeji
Copy link
Contributor Author

xeji commented Sep 24, 2018

backported: 35271fd..570ec19

@xeji xeji deleted the p/installer-tests branch September 24, 2018 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants