Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-driver.py: randomize VM tmp state dir name #90663

Closed
wants to merge 1 commit into from

Conversation

picnoir
Copy link
Member

@picnoir picnoir commented Jun 17, 2020

Motivation for this change

The NixOS test VMs state directory is currently hardcoded to
/tmp/vm-state-$machineName.

While this is perfectly fine when building the test from a sandboxed
environment, we can run into some stray state problem across various
runs when we use the test interactively. These stray state-induced
errors can be pretty subtle to spot and debug.

We fix this problem by using a randomized and unique state dir name
for each run. This should prevent any stray state across several runs
while keeping the state of the previous ones for a potential post-run
analysis.

After a run:

~/Code/nixpkgs(nin-randomize-VM-state*) » ll -t /tmp | head -n 5                                                                                                
total 1324
drwx------ 2 ninjatrappeur users    4096 17 juin  11:29 nixos-test-vde-8qvseqbo-vde1.ctl
drwx------ 3 ninjatrappeur users    4096 17 juin  11:28 vm-state-server-o7qiv1q7
drwx------ 3 ninjatrappeur users    4096 17 juin  11:28 vm-state-client-gwl7rtaa
drwx------ 2 ninjatrappeur users    4096 17 juin  11:27 nixos-test-vde-sq094o70-vde1.ctl
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

The NixOS test VMs state directory is currently hardcoded to
/tmp/vm-state-$machineName.

While this is perfectly fine when building the test from a sandboxed
environment, we can run into some stray state problem across various
runs when we use the test interactively. These stray state-induced
errors can be pretty subtle to spot and debug.

We fix this problem by using a randomized and unique state dir name
for each run. This should prevent any stray state across several runs
while keeping the state of the previous ones for a potential post-run
analysis.
@picnoir picnoir requested a review from tfc as a code owner June 17, 2020 09:32
@nlewo
Copy link
Member

nlewo commented Jun 17, 2020

I think a goal of this state file is to be able to restart a VM (when interactively used). In theory, this should work well, but this is not what I'm observing in practice! So, I always have to remove it.
Another approach could be to remove this file (by default) if it already exists. If a user wants to use it, a flag could be set to not remove it.
This would also avoid the /tmp pollution (which is not a real issue I guess).

@mmilata
Copy link
Member

mmilata commented Jun 17, 2020

No opinion on this matter, however if a change is going to be made the documentation in nixos/doc/manual/development/running-nixos-tests-interactively.xml should be updated accordingly. Also the Perl test driver in nixos/lib/test-driver/Machine.pm.

@edolstra
Copy link
Member

The use of a deterministic name is intentional so we can keep state across reboots. This is useful both for interactive use and in some tests to see whether state it kept correctly across reboots (e.g. uid assignments).

@edolstra edolstra closed this Jun 17, 2020
@picnoir picnoir deleted the nin-randomize-VM-state branch June 17, 2020 10:44
@JJJollyjim
Copy link
Member

@edolstra I think a lot of people find this pretty confusing – a lot of tests fail when the state is persisted, and it's easy to chalk that up to a different change and spend a long time debugging.

@gilligan
Copy link
Contributor

gilligan commented Jun 17, 2020

@edolstra well it is obvious that it is intentional right now.

But instead of just stating this and closing the issue you could have at least tried to relate to the problems that multiple people are stating here - which i have also run into when debugging something.

You don't seem to even acknowledge the possibility that there is an issue that people are running into.

The fact that it hasn't been a problem for you doesn't mean it doesn't exist. I would appreciate if in scenarios like this there could at least be some sort of discourse:

What are the issue that people are facing? What is the suggested solution and how does it collide with existing goals? Is there something that can be done about this that you or others are less opposed to but still addresses the issue described?

I think this would give a less hostile and more encouraging impression.

PS: Just to clarify: i am not saying you are wrong, my concern is purely about the communication and not ignoring concerns:

This is intended behavior. Ticket closed.

versus

I see how this can be problematic but the reason for this is that we want these to be deterministic across reboots. This is all gone with your change. How are we supposed to deal with that? Are you saying we should just drop determinism? This would be a problem because <....>

@andir
Copy link
Member

andir commented Jun 17, 2020

I for one would appreciate a --impure (or similar named) flag to enable that feature. In most cases I don't want shared state but sometimes I need it. It should be opt-in. Most impure things are (moving towards) opt-in in the Nix world.

@gilligan
Copy link
Contributor

gilligan commented Jun 17, 2020

@edolstra is a flag something that you could find acceptable? Where the default behavior is the current behavior and --impure would be something similar to what this PR implements? D'OH that was the wrong way round 😅 but you get the idea..

@edolstra
Copy link
Member

Personally I would just add a warning like using existing state directory /tmp/vm-state-bla. That should alert interactive users that they're reusing an existing VM. IMHO that's better than silently filling up the disk with virtual machine images.

BTW, since NixOS deployments should generally be congruent (i.e. not depend on the previous state of the system), if a test fails because the previous contents of the disk, then there might be an issue with the test or the service that it's testing.

@flokli
Copy link
Contributor

flokli commented Jun 17, 2020

BTW, since NixOS deployments should generally be congruent (i.e. not depend on the previous state of the system), if a test fails because the previous contents of the disk, then there might be an issue with the test or the service that it's testing.

@edolstra This won't work. If you terminate an instance during the test, you might end up with a screwed up file system, and NixOS shouldn't try to automatically fix this and break it further.

If have been bitten by some state on the machines in other occurences as well.

IMHO, The sane default here is to put the qcow image in a tmpdir that's cleaned up after invocation - with some flags to alternate behaviour.

@picnoir
Copy link
Member Author

picnoir commented Jun 17, 2020

Personally I would just add a warning like using existing state directory /tmp/vm-state-bla. That should alert interactive users that they're reusing an existing VM. IMHO that's better than silently filling up the disk with virtual machine images.

Stdout gets flooded by the VMs boot sequence when starting up the test driver. I'm not sure I'd notice such a warning. On top of that I'm generally more in enclined to prevent people from using a footgun altogether rather than warn them they're about to use a footgun.

IMHO that's better than silently filling up the disk with virtual machine images.

That's a fair point!

I could add a --impure flag to this PR allowing you to keep the state after the VM run (ie. the current behavior). On a default run, we could erase the VM state altogether to prevent /tmp to get unnecessarily filled.

^ Does that sound like a fair middle-ground to you? Would you be up to merge such PR?

@domenkozar
Copy link
Member

I would name that something more specific, like --keep-vm-state but 👍 for that over debugging time

@picnoir
Copy link
Member Author

picnoir commented Jun 18, 2020

Implemented this solution there: picnoir@052e4f5

@domenkozar good suggestion, I used this together with -K to reflect nix-build's behaviour.

@mmilata good point regarding the doc. Updated. I also added a line to the release note about this behaviour's change.

Could we re-open this PR to talk about these changes?

@flokli
Copy link
Contributor

flokli commented Jun 18, 2020

@NinjaTrappeur github doesn't allow to reopen this PR, as the branch was deleted. But yes, please open a new PR, and cross-link it from here.

@picnoir
Copy link
Member Author

picnoir commented Jun 18, 2020

👍

See #91046

@ali-abrar
Copy link
Contributor

You can run with TMPDIR=/path/to/state-dir and handle the creation/cleanup of the tmpdir yourself. Hopefully that helps with some use cases.

@picnoir
Copy link
Member Author

picnoir commented Jun 29, 2020

@ali-abrar, #91046 has been merged. It'll delete the VMs state on the test runner startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants