Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Wait for shell for twice as long (10m) #53828

Merged
merged 1 commit into from Jan 12, 2019

Conversation

samueldr
Copy link
Member

See #49441 for an earlier attempt, which was subsequently reverted. I am
assuming that doubling the time will be sufficient if the machine is
overloaded since so many of the tests already pass at 5 minutes, while
still not holding back failures for needlessly long.

Things done
  • ✔️ Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • ✔️ NixOS
    • 🔲 macOS
    • 🔲 other Linux distributions
  • ✔️ Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • ✔️ Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • 🔲 Tested execution of all binary files (usually in ./result/bin/)
  • 🔲 Determined the impact on package closure size (by running nix path-info -S before and after)
  • ✔️ Assured whether relevant documentation is up to date
  • ✔️ Fits CONTRIBUTING.md.

cc @Ekleog @timokau who were involved in discussions on #nixos-dev

See NixOS#49441 for an earlier attempt, which was subsequently reverted. I am
assuming that doubling the time will be sufficient if the machine is
overloaded since so many of the tests already pass at 5 minutes, while
still not holding back failures for needlessly long.
@timokau
Copy link
Member

timokau commented Jan 12, 2019

I re-ran the uefi boot test 17 times in a row. The system was under load for many of these tries. I didn't produce a single timeout. Yet hydra has timed out twice in a row now. Seems fishy.

Increasing the timeout is worth another try I guess.

@timokau
Copy link
Member

timokau commented Jan 12, 2019

It would also be nice to have a little bit more detail in the error message when the timeout is reached. At least the actual timeout value.

Copy link
Contributor

@hedning hedning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try this out. The random test failures consistently block nixos-unstable, and just makes us inattentive to actually failing tests ☹️

@timokau
Copy link
Member

timokau commented Jan 12, 2019

I ran the test 30 times now, most of the times under load, without reproducing any failure. The 31st attempt failed, but with a differnt than usual error ("No space left on device" while creating the iso image, although there should've been plenty of space).

I'm getting less optimistic that this will fix anything, but lets try.

@timokau
Copy link
Member

timokau commented Jan 12, 2019

Nevermind, I did actually somehow run out of space, df was just reporting nonesense.

@7c6f434c
Copy link
Member

Maybe it was out of inodes (df -i)?

@timokau
Copy link
Member

timokau commented Jan 12, 2019

df actually reported a full disk after reboot. I guess it was some btrfs weirdness, I think there are some issues with space reporting.

Anyways, I think that is unrelated.

@Ekleog
Copy link
Member

Ekleog commented Jan 12, 2019

Well… I'd say let's try (and backport to 18.09 which has also been blocked for ~5 days). The failures appear to be quite localized to packet-epyc-1 too (though not exclusively on it, saw one on lucifer too), so it may be something that only happens when lots of jobs are running in parallel.

Anyway, the only other option I can see (if it's not just slow to connect) is that it's blocking before connection, which should be easier to reproduce locally than a timeout, so…

@Ekleog
Copy link
Member

Ekleog commented Jan 12, 2019

Actually something else maybe to note for the underlying issue: release-18.09-small appears unhindered by this issue. Sure it doesn't have the installer tests, but release-18.09 also failed with this error on non-installer tests, like https://hydra.nixos.org/build/86991955 , https://hydra.nixos.org/build/86932614 or https://hydra.nixos.org/build/86748255 .

@samueldr
Copy link
Member Author

Ah, forgot to link this:

That PR intends to add timing data to all tests, with an additional specific check for the "connecting" phase. This should help us find out where the hangup is.

@samueldr samueldr merged commit 3b68ddb into NixOS:master Jan 12, 2019
@samueldr
Copy link
Member Author

[release-18.09 4d5935d] tests: Wait for shell for twice as long (10m)
Date: Fri Jan 11 22:40:19 2019 -0500
1 file changed, 1 insertion(+), 1 deletion(-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants