Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scalapack: Increase individual test timeout to prevent hydra failures #62128

Merged
merged 1 commit into from May 27, 2019

Conversation

infinisil
Copy link
Member

Motivation for this change

Build occasionally failed (see https://hydra.nixos.org/job/nixpkgs/trunk/scalapack.x86_64-linux/all?page=1) due to individual tests timing out:

95/96 Test #93: xcheevr ..........................***Timeout 1500.05 sec

[...]

95% tests passed, 5 tests failed out of 96

Total Test time (real) = 4836.54 sec

The following tests FAILED:
         44 - xdllt (Timeout)
         45 - xcllt (Timeout)
         91 - xssyevr (Timeout)
         92 - xdsyevr (Timeout)
         93 - xcheevr (Timeout)
Errors while running CTest
make: *** [Makefile:141: test] Error 8
builder for '/nix/store/kszycfkp7aimpxx8j9kr3lgk3iiwb5sj-scalapack-2.0.2.drv' failed with exit code 2

As I found out through http://icl.cs.utk.edu/lapack-forum/viewtopic.php?t=4861, this 1500s timeout is built into cmake itself and can only be overwritten globally by passing --timeout to ctest, which is what this PR does with a value of 10000s.

An alternative would be to just disable the tests, which might be desirable, because they take ~1h20 to run, in comparison to the ~10 minutes of the build itself! Honestly this is a huge waste of build power, and I doubt the tests will ever give us a failure anyways.

Ping @costrouc @markuskowa

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@markuskowa
Copy link
Member

The tests itself are not the problem but the over commitment of the Hydra build machines causes the problem. The tests run on full CPU load, which is fine when the CPUs attributed to the build job are fully available. When I run the build (including tests) on my local machines the tests take hardly take any longer than the build phase.

The tests itself ensure that the package is fully compatible with MPI and the underlying blas library and as such I would prefer to leave them on. There are some subtleties with integer sizes in Fortran that would not show up when code is compiled but leads to a crash once the code is run.

Turning off parallel building might actually defuse the problem. Is it possible to turn off parallel building only for the check phase?

@infinisil
Copy link
Member Author

infinisil commented May 27, 2019

Turning off parallel building might actually defuse the problem. Is it possible to turn off parallel building only for the check phase?

Yeah that would be possible, although a bit hacky. But we don't actually know how many cores it gets built with, and if this is just 1, then this won't necessarily help anything, I think.

So I feel like this PR is a better solution to this problem, especially considering there's other people like in that thread who need to increase this timeout.

@markuskowa
Copy link
Member

Yeah that would be possible, although a bit hacky. But we don't actually know how many cores it gets>? > built with, and if this is just 1, then this won't help anything, I think.

It probably would still help on Hydra since jobs are often run with j>10, which results in 10 or more tests running in parallel each consuming 2-4 CPUs.

I am OK with increasing the timeout to resolve the issue.

@markuskowa markuskowa merged commit 8e9cb55 into NixOS:master May 27, 2019
@infinisil infinisil deleted the scalapack-test-timeout branch May 27, 2019 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants