Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-driver.py: delete VM state directory after test run #91046

Merged
merged 1 commit into from Jun 28, 2020

Conversation

picnoir
Copy link
Member

@picnoir picnoir commented Jun 18, 2020

Motivation for this change

Continuation of #90663, read the discussion there first to get the full context of this PR.

Keeping the VM state test across several run sometimes lead to subtle
and hard to spot errors in practice.

Things done

We delete the VM state which
contains (among other things) the qcow volume.

We also introduce a -K (--keep-vm-state) flag making VM state to
persist after the test run. This flag makes test-driver.py to match
its previous behaviour.

Note: some diffs in the running-nixos-tests-interactively.xml file are coming from the docbook autoformater after running make. I decided to bit the bullet and these diffs, preventing any future reformatting. I can undo this if necessary.

  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

nixos/lib/test-driver/test-driver.py Outdated Show resolved Hide resolved
nixos/lib/test-driver/test-driver.py Outdated Show resolved Hide resolved
@jtojnar
Copy link
Contributor

jtojnar commented Jun 19, 2020

One thing I do not like about the keep flag (with nix-build too) is that I need to make the choice before starting the VM. If I start a long running test I do not expect to fail and then it fails, I need to run it again.

Not saying the current default is better, I was bitten by that too.

@picnoir
Copy link
Member Author

picnoir commented Jun 19, 2020 via email

@danielfullmer
Copy link
Contributor

That's a fair point. Just to be sure I understand what you're suggesting: you'd rather like to delete a potential VM state during the machine startup instead of cleaning that as soon as we shutdown the machine?

That would be my personal preference. Keep in mind that machine startup/shutdown does not always correspond exactly with script invocation/cleanup. We can have additional machine.shutdown() and machine.startup() calls inside the testScript, and I'd definitely still like to retain state between reboots that take place inside the script.

The current PR does this properly by removing state in clean_up() set with atexit.register. If you switched to removing VM state during startup, you'd need to be sure it only happens during the first VM startup in the script.

Keeping the VM state test across several run sometimes lead to subtle
and hard to spot errors in practice. We delete the VM state which
contains (among other things) the qcow volume.

We also introduce a -K (--keep-vm-state) flag making VM state to
persist after the test run. This flag makes test-driver.py to match
its previous behaviour.
@picnoir
Copy link
Member Author

picnoir commented Jun 21, 2020

I updated the script to remove the state on startup instead of doing that during the test teardown.

@danielfullmer you're absolutely right. I took care of doing that during the test startup and not during the machine startup.

I updated the doc accordingly to this behaviors change.

@picnoir
Copy link
Member Author

picnoir commented Jun 23, 2020

@Ma27 @flokli @nlewo @jtojnar What's the next step here?

@Ma27
Copy link
Member

Ma27 commented Jun 28, 2020

It's fine from my PoV. @flokli can we merge?

@flokli
Copy link
Contributor

flokli commented Jun 28, 2020

👍

@flokli flokli merged commit 9e248c9 into NixOS:master Jun 28, 2020
@edolstra
Copy link
Member

Deleting virtual machine images without confirmation seems like the worst possible default imaginable. It's basically an unprompted rm -rf /. Shouldn't it be the other way around (--no-keep-vm-state)?

@flokli
Copy link
Contributor

flokli commented Jun 29, 2020

Deleting virtual machine images without confirmation seems like the worst possible default imaginable.

From the discussions in #90663 (comment), cleaning up, plus a --keep-vm-state flag to override the behaviour was what most people felt reasonable. Cleanup is explicitly done on startup (not teardown), so post-mortem analysis is possible, no matter how the flag is set.

This is called test-driver, I wouldn't consider its artifacts to be worth keeping, at least not by default. If this wants to rely on state from previous invocations, these runs should be part of the testScript itself IMHO.

As described in the linked issue, multiple people ran into annoying behaviour, until they realized it was the qcow file that was kept between restarts. If people want that behaviour, they have a way to opt in to it - and it's documented in the release notes, too.

@picnoir picnoir deleted the nin-delete-vm-state branch June 30, 2020 09:25
@picnoir
Copy link
Member Author

picnoir commented Jun 30, 2020

Deleting virtual machine images without confirmation seems like the worst possible default imaginable.

Citation needed.

It's basically an unprompted rm -rf /

If you mean rm -rf / on the host, it is not. If you mean rm -rf / on the VM machine, that's precisely what this PR is about: removing a temporary (saved in $TMPDIR) stray state that prevent to interactively run a NixOS test with a deterministic starting point. As Flokli mentioned above, those are test VMs, their state shouldn't have to outlive the test runner; except in some cases for debugging purposes, that's why we added the --keep-vm-state escape hatch.

This stray state issue bit me personally twice. Several people publicly stated being bitten by this in these two PRs. Some people also mentioned me being bitten by this at some point in some private side-channels.

I know as a matter of fact some people are wrapping the test runner in a bash script removing any stray state before running the actual VM test, some other people are moving TMPDIR somewhere else to prevent a previous stray state to pollute the VM test. All these things are dirty hack to go around this massive foot gun. As I said earlier, I think the proper fix is to remove this footgun altogether here.

Shouldn't it be the other way around (--no-keep-vm-state)?

The whole point of this PR is to make the precise opposite behavior the default one.

I'm open to discuss this and change my mind. As a matter of fact, I actually incorporated almost all the requested changes as part of this PR.

That being said, you need to back up your affirmations with some practical arguments if you want to move this discussion ahead.

@edef1c
Copy link
Member

edef1c commented Sep 2, 2020

As far as I can tell, this change is entirely inert: os.path.isfile will always return False for the state directory, so this is dead code.
When I fix it…

diff --git a/third_party/nixpkgs/nixos/lib/test-driver/test-driver.py b/third_party/nixpkgs/nixos/lib/test-driver/test-driver.py
index 7b8d5803a..43281d5e8 100644
--- a/third_party/nixpkgs/nixos/lib/test-driver/test-driver.py
+++ b/third_party/nixpkgs/nixos/lib/test-driver/test-driver.py
@@ -778,7 +778,7 @@ class Machine:

     def cleanup_statedir(self) -> None:
         self.log("delete the VM state directory")
-        if os.path.isfile(self.state_dir):
+        if os.path.isdir(self.state_dir):
             shutil.rmtree(self.state_dir)

     def shutdown(self) -> None:

…it breaks everything:

>>> start_all()
starting all VMs
haproxy: starting vm
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nix/store/7lyxayhc23jpy8wljk2k62k5dgrdwzw4-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 874, in start_all
    machine.start()
  File "/nix/store/7lyxayhc23jpy8wljk2k62k5dgrdwzw4-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 713, in start
    self.monitor_socket = create_socket(monitor_path)
  File "/nix/store/7lyxayhc23jpy8wljk2k62k5dgrdwzw4-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 708, in create_socket
    s.bind(path)
FileNotFoundError: [Errno 2] No such file or directory

[Errno 2] No such file or directory

@edef1c
Copy link
Member

edef1c commented Sep 2, 2020

As an aside, a piece of software logging "delete the VM state directory" (ie, in the imperative case) makes me consider whether I—the human—am supposed to delete the directory, and definitely confused me for a bit given that the software certainly didn't seem to be deleting the directory itself. It probably should log "deleted the VM state directory", but only after it has actually successfully done so.
We should probably mention "if you want to keep VM state, pass --keep-vm-state", perhaps logging that anytime the flag isn't set, so that it's mentioned on the first run, and users can avoid losing state they intend to retain on the second run.

@picnoir
Copy link
Member Author

picnoir commented Sep 2, 2020

Thanks for the heads up. I realized this recently and worked on a fix, I totally forgot to upstream that.

I'm still a bit unsure about the fix, I need to spend a few brain cycles to be sure it's not making things worse. I'll try to PR that soon together with the message improvement you're suggesting.

In the meantime, here's the diff I'm currently applying.

diff --git a/nixos/lib/test-driver/test-driver.py b/nixos/lib/test-driver/test-driver.py
index 612bb02aaa9..bc19033277b 100644
--- a/nixos/lib/test-driver/test-driver.py
+++ b/nixos/lib/test-driver/test-driver.py
@@ -162,6 +162,11 @@ class Machine:
                 if match:
                     self.name = match.group(1)
 
+        self.logger = MachineLogAdapter(
+            logger,
+            extra=dict(machine=self.name, colour_code=next(machine_colours_iter)),
+        )
+
         self.script = args.get("startCommand", self.create_startcommand(args))
 
         tmp_dir = os.environ.get("TMPDIR", tempfile.gettempdir())
@@ -171,7 +176,10 @@ class Machine:
             os.makedirs(path, mode=0o700, exist_ok=True)
             return path
 
-        self.state_dir = create_dir("vm-state-{}".format(self.name))
+        self.state_dir = os.path.join(tmp_dir, f"vm-state-{self.name}")
+        if not args["keepVmState"]:
+            self.cleanup_statedir()
+        os.makedirs(self.state_dir, mode=0o700, exist_ok=True)
         self.shared_dir = create_dir("shared-xchg")
 
         self.booted = False
@@ -180,10 +188,6 @@ class Machine:
         self.socket = None
         self.monitor: Optional[socket.socket] = None
         self.allow_reboot = args.get("allowReboot", False)
-        self.logger = MachineLogAdapter(
-            logger,
-            extra=dict(machine=self.name, colour_code=next(machine_colours_iter)),
-        )
 
     @staticmethod
     def create_startcommand(args: Dict[str, str]) -> str:
@@ -714,8 +718,9 @@ class Machine:
         self.logger.info(f"QEMU running (pid {self.pid})")
 
     def cleanup_statedir(self) -> None:
-        self.logger.info("delete the VM state directory")
-        if os.path.isfile(self.state_dir):
+        self.logger.info(self.state_dir)
+        if os.path.isdir(self.state_dir):
+            self.logger.info(f"delete the VM state directory")
             shutil.rmtree(self.state_dir)
 
     def shutdown(self) -> None:
@@ -870,10 +875,10 @@ def main() -> None:
     for nr, vde_socket, _, _ in vde_sockets:
         os.environ["QEMU_VDE_SOCKET_{}".format(nr)] = vde_socket
 
-    machines = [create_machine({"startCommand": s}) for s in vm_scripts]
-    for machine in machines:
-        if not cli_args.keep_vm_state:
-            machine.cleanup_statedir()
+    machines = [
+        create_machine({"startCommand": s, "keepVmState": cli_args.keep_vm_state})
+        for s in vm_scripts
+    ]
     machine_eval = [
         "global {0}; {0} = machines[{1}]".format(m.name, idx)
         for idx, m in enumerate(machines)

@edef1c
Copy link
Member

edef1c commented Sep 2, 2020

Thanks for the quick response, and glad to hear you're aware of the issue.
Unfortunately, the patch doesn't seem to apply cleanly to nixpkgs master nor the version of nixpkgs I'm using, and it looks like the code has diverged enough that porting the patch is a bit of a chore. C'est la vie — I'll see if I can rebase it this week.

+            self.logger.info(f"delete the VM state directory")

Not sure why the f-string, and we should definitely say either "deleted" (past tense, after deleting) or "deleting" (present continuous, printed before deleting), rather than "delete" (imperative case, asking the user to do something).

@flokli
Copy link
Contributor

flokli commented Sep 6, 2020

@NinjaTrappeur did you get a chance to PR the fix in the meantime?

flokli pushed a commit that referenced this pull request Sep 10, 2020
The previous version of the code would only kick in if the state
directory path pointed at a *file*, which never occurs. Making that
codepath actually work reveals an ordering bug, which this patch fixes
as well.

It also replaces the confusing, imperative case log message "delete VM
state directory" with "deleting VM state directory".

Finally, we hint the user about how to prevent this deletion. IE. by
passing the --keep-vm-state flag.

Bug report:
#91046 (comment)

Credit goes to Edef for the rebase on top of a recent nixpkgs commit
and for writing most of this commit message.

Co-authored-by: edef <edef@edef.eu>
flokli pushed a commit that referenced this pull request Sep 10, 2020
The previous version of the code would only kick in if the state
directory path pointed at a *file*, which never occurs. Making that
codepath actually work reveals an ordering bug, which this patch fixes
as well.

It also replaces the confusing, imperative case log message "delete VM
state directory" with "deleting VM state directory".

Finally, we hint the user about how to prevent this deletion. IE. by
passing the --keep-vm-state flag.

Bug report:
#91046 (comment)

Credit goes to Edef for the rebase on top of a recent nixpkgs commit
and for writing most of this commit message.

Co-authored-by: edef <edef@edef.eu>
(cherry picked from commit ecb73fd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants