Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage-1: Detect hung tasks and abort boot #236

Merged
merged 4 commits into from Nov 8, 2020

Conversation

samueldr
Copy link
Member

@samueldr samueldr commented Nov 5, 2020

This builds on #234


With this PR, the boot is aborted when no tasks can be run for a given amount of time.

The test case I used for this is to produce the demo image with a bad filesystems."/".device.

diff --git a/modules/rootfs.nix b/modules/rootfs.nix
index ff491e2c998f..7bf93a9526d2 100644
--- a/modules/rootfs.nix
+++ b/modules/rootfs.nix
@@ -78,7 +78,7 @@ in
 
   fileSystems = {
     "/" = lib.mkDefault {
-      device = "/dev/disk/by-label/${rootfsLabel}";
+      device = "/dev/disk/by-label/${rootfsLabel}_FAIL";
       fsType = "ext4";
       autoResize = true;
     };

This is representative enough of someone not flashing the rootfs on their device.

The user experience is as follows:

  • When in the situation of hanging, a message is shown on the progress display.
  • When it fails, the existing failure handler gets a more complete message.

image

This is not ideal, it would be great to somehow be able to resolve this to better error messages, BUT, we have all the information right in there for the end-user. So while it's not the best UX possible, it's the better option, being totally transparent.

@samueldr samueldr added the 3. topic: stage-1 stage-1, boot, init label Nov 5, 2020
@samueldr
Copy link
Member Author

samueldr commented Nov 5, 2020

The failure applet might be re-designed a bit soon, so don't worry about its appearance.

And if you worry about it, please provide a better implementation!

@samueldr samueldr added the 4. type: enhancement New feature or request label Nov 5, 2020
@samueldr samueldr marked this pull request as draft November 7, 2020 21:14
@samueldr samueldr marked this pull request as ready for review November 8, 2020 01:12
 - exit after everything happened, just in case
 - exit the progress display
 - allow a custom delay to be set
It's a Files dependency, but with just a bit more user friendliness when
used in an error message.
The way we're handling them is to have a global timer that is reset at
any point a task is ran.

This gives a maximum amount of chances to any task to have its
dependencies resolve.

A minimum of 60s is given, but in reality the chances are the conditions
for trying to resolve were already present before the timeout started
counting towards that particular dependency.

Note that a long running task, when successfully ran, does not cause the
timeout to be reached.

E.g. at 10s of timeout a task is started, the loop is not executed until
the task exits. When it exits the branch followed is for a task that
ran, which means that even if the task took 70s total (which gives us 80
seconds) a timeout of 60s wouldn't apply here.

Though, please, don't make your tasks take that much time to run!
@samueldr
Copy link
Member Author

samueldr commented Nov 8, 2020

When I tested #234 on device, it was actually tested with this PR.

I will add that this was successfully faiiling the boot in a useful way when I broke my Pinephone setup!

@samueldr samueldr merged commit 61566f3 into NixOS:master Nov 8, 2020
@samueldr samueldr deleted the feature/stage-1-hung-tasks branch November 8, 2020 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3. topic: stage-1 stage-1, boot, init 4. type: enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant