test-driver.py: randomize VM tmp state dir name #90663

picnoir · 2020-06-17T09:32:13Z

Motivation for this change

The NixOS test VMs state directory is currently hardcoded to
/tmp/vm-state-$machineName.

While this is perfectly fine when building the test from a sandboxed
environment, we can run into some stray state problem across various
runs when we use the test interactively. These stray state-induced
errors can be pretty subtle to spot and debug.

We fix this problem by using a randomized and unique state dir name
for each run. This should prevent any stray state across several runs
while keeping the state of the previous ones for a potential post-run
analysis.

After a run:

~/Code/nixpkgs(nin-randomize-VM-state*) » ll -t /tmp | head -n 5                                                                                                
total 1324
drwx------ 2 ninjatrappeur users    4096 17 juin  11:29 nixos-test-vde-8qvseqbo-vde1.ctl
drwx------ 3 ninjatrappeur users    4096 17 juin  11:28 vm-state-server-o7qiv1q7
drwx------ 3 ninjatrappeur users    4096 17 juin  11:28 vm-state-client-gwl7rtaa
drwx------ 2 ninjatrappeur users    4096 17 juin  11:27 nixos-test-vde-sq094o70-vde1.ctl

Things done

The NixOS test VMs state directory is currently hardcoded to /tmp/vm-state-$machineName. While this is perfectly fine when building the test from a sandboxed environment, we can run into some stray state problem across various runs when we use the test interactively. These stray state-induced errors can be pretty subtle to spot and debug. We fix this problem by using a randomized and unique state dir name for each run. This should prevent any stray state across several runs while keeping the state of the previous ones for a potential post-run analysis.

nlewo · 2020-06-17T09:37:58Z

I think a goal of this state file is to be able to restart a VM (when interactively used). In theory, this should work well, but this is not what I'm observing in practice! So, I always have to remove it.
Another approach could be to remove this file (by default) if it already exists. If a user wants to use it, a flag could be set to not remove it.
This would also avoid the /tmp pollution (which is not a real issue I guess).

mmilata · 2020-06-17T10:21:25Z

No opinion on this matter, however if a change is going to be made the documentation in nixos/doc/manual/development/running-nixos-tests-interactively.xml should be updated accordingly. Also the Perl test driver in nixos/lib/test-driver/Machine.pm.

edolstra · 2020-06-17T10:33:15Z

The use of a deterministic name is intentional so we can keep state across reboots. This is useful both for interactive use and in some tests to see whether state it kept correctly across reboots (e.g. uid assignments).

JJJollyjim · 2020-06-17T11:07:16Z

@edolstra I think a lot of people find this pretty confusing – a lot of tests fail when the state is persisted, and it's easy to chalk that up to a different change and spend a long time debugging.

gilligan · 2020-06-17T11:23:45Z

@edolstra well it is obvious that it is intentional right now.

But instead of just stating this and closing the issue you could have at least tried to relate to the problems that multiple people are stating here - which i have also run into when debugging something.

You don't seem to even acknowledge the possibility that there is an issue that people are running into.

The fact that it hasn't been a problem for you doesn't mean it doesn't exist. I would appreciate if in scenarios like this there could at least be some sort of discourse:

What are the issue that people are facing? What is the suggested solution and how does it collide with existing goals? Is there something that can be done about this that you or others are less opposed to but still addresses the issue described?

I think this would give a less hostile and more encouraging impression.

PS: Just to clarify: i am not saying you are wrong, my concern is purely about the communication and not ignoring concerns:

This is intended behavior. Ticket closed.

versus

I see how this can be problematic but the reason for this is that we want these to be deterministic across reboots. This is all gone with your change. How are we supposed to deal with that? Are you saying we should just drop determinism? This would be a problem because <....>

andir · 2020-06-17T11:35:46Z

I for one would appreciate a --impure (or similar named) flag to enable that feature. In most cases I don't want shared state but sometimes I need it. It should be opt-in. Most impure things are (moving towards) opt-in in the Nix world.

gilligan · 2020-06-17T11:51:06Z

@edolstra is a flag something that you could find acceptable? ~~Where the default behavior is the current behavior and --impure would be something similar to what this PR implements?~~ D'OH that was the wrong way round 😅 but you get the idea..

edolstra · 2020-06-17T12:32:15Z

Personally I would just add a warning like using existing state directory /tmp/vm-state-bla. That should alert interactive users that they're reusing an existing VM. IMHO that's better than silently filling up the disk with virtual machine images.

BTW, since NixOS deployments should generally be congruent (i.e. not depend on the previous state of the system), if a test fails because the previous contents of the disk, then there might be an issue with the test or the service that it's testing.

flokli · 2020-06-17T12:38:32Z

BTW, since NixOS deployments should generally be congruent (i.e. not depend on the previous state of the system), if a test fails because the previous contents of the disk, then there might be an issue with the test or the service that it's testing.

@edolstra This won't work. If you terminate an instance during the test, you might end up with a screwed up file system, and NixOS shouldn't try to automatically fix this and break it further.

If have been bitten by some state on the machines in other occurences as well.

IMHO, The sane default here is to put the qcow image in a tmpdir that's cleaned up after invocation - with some flags to alternate behaviour.

picnoir · 2020-06-17T12:49:50Z

Personally I would just add a warning like using existing state directory /tmp/vm-state-bla. That should alert interactive users that they're reusing an existing VM. IMHO that's better than silently filling up the disk with virtual machine images.

Stdout gets flooded by the VMs boot sequence when starting up the test driver. I'm not sure I'd notice such a warning. On top of that I'm generally more in enclined to prevent people from using a footgun altogether rather than warn them they're about to use a footgun.

IMHO that's better than silently filling up the disk with virtual machine images.

That's a fair point!

I could add a --impure flag to this PR allowing you to keep the state after the VM run (ie. the current behavior). On a default run, we could erase the VM state altogether to prevent /tmp to get unnecessarily filled.

^ Does that sound like a fair middle-ground to you? Would you be up to merge such PR?

domenkozar · 2020-06-17T19:21:03Z

I would name that something more specific, like --keep-vm-state but 👍 for that over debugging time

picnoir · 2020-06-18T15:04:11Z

Implemented this solution there: picnoir@052e4f5

@domenkozar good suggestion, I used this together with -K to reflect nix-build's behaviour.

@mmilata good point regarding the doc. Updated. I also added a line to the release note about this behaviour's change.

Could we re-open this PR to talk about these changes?

flokli · 2020-06-18T16:07:14Z

@NinjaTrappeur github doesn't allow to reopen this PR, as the branch was deleted. But yes, please open a new PR, and cross-link it from here.

picnoir · 2020-06-18T16:26:24Z

👍

See #91046

ali-abrar · 2020-06-29T16:03:12Z

You can run with TMPDIR=/path/to/state-dir and handle the creation/cleanup of the tmpdir yourself. Hopefully that helps with some use cases.

picnoir · 2020-06-29T16:47:12Z

@ali-abrar, #91046 has been merged. It'll delete the VMs state on the test runner startup.

picnoir requested a review from tfc as a code owner June 17, 2020 09:32

ofborg bot added the 6.topic: nixos label Jun 17, 2020

ofborg bot added 10.rebuild-darwin: 0 10.rebuild-linux: 1-10 10.rebuild-linux: 1 labels Jun 17, 2020

picnoir requested a review from edolstra June 17, 2020 09:51

edolstra closed this Jun 17, 2020

picnoir deleted the nin-randomize-VM-state branch June 17, 2020 10:44

picnoir mentioned this pull request Jun 18, 2020

test-driver.py: delete VM state directory after test run #91046

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test-driver.py: randomize VM tmp state dir name #90663

test-driver.py: randomize VM tmp state dir name #90663

picnoir commented Jun 17, 2020

nlewo commented Jun 17, 2020

mmilata commented Jun 17, 2020

edolstra commented Jun 17, 2020

JJJollyjim commented Jun 17, 2020

gilligan commented Jun 17, 2020 •

edited

andir commented Jun 17, 2020

gilligan commented Jun 17, 2020 •

edited

edolstra commented Jun 17, 2020

flokli commented Jun 17, 2020 •

edited

picnoir commented Jun 17, 2020 •

edited

domenkozar commented Jun 17, 2020

picnoir commented Jun 18, 2020

flokli commented Jun 18, 2020

picnoir commented Jun 18, 2020

ali-abrar commented Jun 29, 2020

picnoir commented Jun 29, 2020 •

edited

test-driver.py: randomize VM tmp state dir name #90663

test-driver.py: randomize VM tmp state dir name #90663

Conversation

picnoir commented Jun 17, 2020

Motivation for this change

Things done

nlewo commented Jun 17, 2020

mmilata commented Jun 17, 2020

edolstra commented Jun 17, 2020

JJJollyjim commented Jun 17, 2020

gilligan commented Jun 17, 2020 • edited

andir commented Jun 17, 2020

gilligan commented Jun 17, 2020 • edited

edolstra commented Jun 17, 2020

flokli commented Jun 17, 2020 • edited

picnoir commented Jun 17, 2020 • edited

domenkozar commented Jun 17, 2020

picnoir commented Jun 18, 2020

flokli commented Jun 18, 2020

picnoir commented Jun 18, 2020

ali-abrar commented Jun 29, 2020

picnoir commented Jun 29, 2020 • edited

gilligan commented Jun 17, 2020 •

edited

gilligan commented Jun 17, 2020 •

edited

flokli commented Jun 17, 2020 •

edited

picnoir commented Jun 17, 2020 •

edited

picnoir commented Jun 29, 2020 •

edited