Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a delay before killing the builder process #2201

Closed
wants to merge 1 commit into from

Conversation

nlewo
Copy link
Member

@nlewo nlewo commented Jun 1, 2018

Sometimes, the builder process is killed while the build succeeded. It
seems this is because the signal is sent after file descriptors are
closed (in do_exit) and before the process reaches the terminated
state. This leads to a build failure (failed due to signal 9).

To mitigate this issue, a delay (~10s) is introduced before sending the kill
signal to the build process if it doesn't reach the terminated state.

Fixes #2176

Sometimes, the builder process is killed while the build succeeded. It
seems this is because the signal is sent after file descriptors are
closed (in `do_exit`) and before the process reaches the terminated
state. This leads to a build failure (failed due to signal 9).

To mitigate this issue, a delay (~10s) is introduced before sending the kill
signal to the build process if it doesn't reach the terminated state.

Fixes NixOS#2176
@nlewo
Copy link
Member Author

nlewo commented Jun 1, 2018

I'm not sure about the root cause of the issue #2176 but with this patch, I'm no longer able to reproduce it (30000 iterations).
Moreover, this is my first C++/nix patch... so please consider this PR as a poc:/

@edolstra
Copy link
Member

edolstra commented Jun 4, 2018

Adding arbitrary timeouts doesn't seem like a good idea. For instance, why is 10 seconds an appropriate value? Doesn't this mean that the build can still fail randomly under high load? It would probably be better to fix the shell not to close stdout/stderr before exiting.

@nlewo
Copy link
Member Author

nlewo commented Jun 4, 2018

I agree arbitrary timeouts are not a good idea but I don't know how to "fix" this bug otherwise. Moreover we currently have this timeout but we don't know its value!

I think the problem comes from the way the worker knows the builder finished: we wait for closed file descriptors. But to terminate a process, the Linux kernel seems to first close file descriptors and then, it changes the status of the process.
It is possible the worker process sends the kill signal before these two steps. With this patch, the kernel have 10 seconds to mark the process as terminated.

It is still possible the build fails under high load. But from my basic experiments, I was no longer able to hit this issue with 10sec.

Do you suggest to prevent user build code to close stdout/stderr in order to remove the kill expression? I'll investigate this way.

@dezgeg
Copy link
Contributor

dezgeg commented Jun 4, 2018

Maybe it would be better to switch 21948de to using something non-blocking (like catching SIGCHLD + self-pipe trick or signalfd).

@tomberek
Copy link
Contributor

Ran into this problem when running Nix on AWS Lambda. (https://github.com/tomberek/lambdanix). For what it's worth, this patch fixed it. Not sure why. I was also completely abusing the pseud-terminal interface (AWS does not give you access to one), so this could be other issues interacting.

@stale
Copy link

stale bot commented Feb 13, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Feb 13, 2021
@stale
Copy link

stale bot commented Apr 16, 2022

I closed this issue due to inactivity. → More info

@stale stale bot closed this Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Builder is sometimes unexpectedly killed
4 participants