Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NixOS tests: Wait for shell for 10x longer (50m) #49441

Merged
merged 1 commit into from Oct 30, 2018

Conversation

srhb
Copy link
Contributor

@srhb srhb commented Oct 30, 2018

Motivation for this change

#49384

This increases the timeout waiting for the shell to the virtual machine to connect by 10 times, to 50 minutes, in order to investigate the failures that frequently plague and block our tested set, especially the tests that have a restart incorporated (installer.*)

I have very low confidence that this is the actual issue, so it is mostly intended as a check. I find it more likely that the shell will never connect, even though the machine has finished booting, and that somehow we recently (~last month) introduced a single change that caused this, or caused this to occur much more frequently.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

@grahamc
Copy link
Member

grahamc commented Oct 30, 2018

Another test to try, independent of this, might be reverting the new backdoor.

@srhb
Copy link
Contributor Author

srhb commented Oct 30, 2018

Another test to try, independent of this, might be reverting the new backdoor.

I agree, though I'm waiting for Domen and co. to convince me that this is superstitious. :P

@grahamc
Copy link
Member

grahamc commented Oct 30, 2018

It seems to be your best lead, so why wait? :)

@srhb
Copy link
Contributor Author

srhb commented Oct 30, 2018

Because it's a lead based solely on the timing of the commits. I can't reproduce any change in the behaviour locally with any timeout with/without the backdoor commit. Of course, it's really difficult to replicate the conditions on the builder machines realistically, but I'd like to have something slightly more solid than hand-waving. Mind, whether we do the timeout test first or the backdoor test makes little difference to me. I just want more data, and either helps. :)

@xeji
Copy link
Contributor

xeji commented Oct 30, 2018

Let's give this a try first. I also had a hunch that the failures could be related to the new VM backdoor, but no evidence at all.

@xeji xeji merged commit 8bbdee0 into NixOS:master Oct 30, 2018
@srhb srhb deleted the debug-hydra-failures branch October 30, 2018 11:11
@xeji
Copy link
Contributor

xeji commented Oct 31, 2018

@srhb should we backport this to 18.09? Might be easier to see the effects there, and it shouldn't break anything.

@srhb
Copy link
Contributor Author

srhb commented Oct 31, 2018

@xeji I actually prefer having it on one but not the other. Though maybe it should have been on release-18.09 instead...

@srhb
Copy link
Contributor Author

srhb commented Nov 2, 2018

Unfortunately/thankfully a lot of other changes happened at about the same time to hopefully unblock the channels. Just got the first indication that this is still a problem, and not really a timeout but some other issue: https://hydra.nixos.org/build/83438035/nixlog/19/tail

rebuild-switch# [   48.906985] systemd[1]: Reached target Multi-User System.
rebuild-switch# [   48.909455] systemd[1]: Startup finished in 26.106s (kernel) + 22.660s (userspace) = 48.767s.
rebuild-switch# [   52.424136] dhcpcd[663]: eth0: no IPv6 Routers available
rebuild-switch# [  907.531101] systemd[1]: Starting Cleanup of Temporary Directories...
rebuild-switch# [  907.549349] systemd[1]: Started Cleanup of Temporary Directories.
error: timed out waiting for the VM to connect
timed out waiting for the VM to connect
cleaning up
killing rebuild-switch (pid 664)
vde_switch: EOF on stdin, cleaning up and exiting
builder for '/nix/store/g5j6jzxilpqjfhwxv85iphaw18znamas-vm-test-run-installer-btrfsSubvols.drv' failed with exit code 4

samueldr added a commit to samueldr/nixpkgs that referenced this pull request Jan 12, 2019
See NixOS#49441 for an earlier attempt, which was subsequently reverted. I am
assuming that doubling the time will be sufficient if the machine is
overloaded since so many of the tests already pass at 5 minutes, while
still not holding back failures for needlessly long.
samueldr added a commit that referenced this pull request Jan 12, 2019
See #49441 for an earlier attempt, which was subsequently reverted. I am
assuming that doubling the time will be sufficient if the machine is
overloaded since so many of the tests already pass at 5 minutes, while
still not holding back failures for needlessly long.

(cherry picked from commit b28b37e)
voobscout pushed a commit to voobscout/nixpkgs that referenced this pull request Jan 16, 2019
See NixOS#49441 for an earlier attempt, which was subsequently reverted. I am
assuming that doubling the time will be sufficient if the machine is
overloaded since so many of the tests already pass at 5 minutes, while
still not holding back failures for needlessly long.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants