-
-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGTERM still has intermittently wrong behavior w/ $CHILD_STATUS.termsig #5224
Comments
Sorry I missed your comment on the other issue, I did not notice we still had some issues. Poking around... |
Ok so as far as I can tell we are doing the right thing on the child side; the exit status is 15 + 128 (143) matching what MRI (and other processes) do for a signal-related termination. The issue with the spec for
Your case above appears to be similar in that it's seeing the raw return code rather than the signal status. |
These results are very confusing. Even though it seems like we're doing the right thing in the child, just swapping from CRuby to JRuby in the child (using a CRuby parent) causes the following results:
The 36608 value here appears to be the return code shifted left 8 bits, which as far as I can tell from CRuby source should only happen for failed spawns, and that's still on the parent side. With our child returning a proper error code I'm baffled why CRuby would have such different results here. What am I missing? |
FTR, using the native launcher instead of the bash script does not fix the problem:
|
Also, don't know what would cause this to intermittently fail for you unless the child process is only sometimes exiting with a signal return code. As far as I can see that would only happen if some other error were raised or the process terminated normally. |
Worth pointing out that JRuby parent with CRuby child seems to work fine, making this even more baffling to me:
|
Ah-ha...I think I figured it out. The JVM's standard mechanism for exiting basically just calls the system I found this link: https://www.gnu.org/software/libc/manual/html_node/Termination-in-Handler.html With this text:
So I think the right thing to do is re-propagate the signal using equivalent code in our |
Ok, I added |
Whew, ok...I think I've figured out that this is not our fault. Here's an example using no JRuby at all...the subprocess is a Java program that just loops forever until killed or terminated. The resulting exit status is the same as JRuby.
So it seems like the standard exit/termination process for a JVM does not do the right thing for SIGTERM at the very least. |
I've posted a question about this to an internal Red Hat JVM maintainers list for more help. |
Thanks a lot for all the help in this! Please keep us posted if you find something that can help. Looked at the logs for the intermittent errors, but I can't even seem to find the error occurring any more. More so, rspec errors which fail because we randomize the order. 😏 So, I'm note sure about the "intermittent" part any more. Either way, we can both see above that there seems to be an issue with this, be it intermittent or not. |
Ok so I've posted to a public list because the folks inside Red Hat suggested I do so: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2018-June/028864.html This removes Ruby from the equation altogether, and I think shows pretty poor behavior on the part of Hotspot. One of the Red Hat folks did discover that the I'll keep ya posted! |
It appears the magic piece of my patch is actually calling the raise(3) function to allow the system to also handle the SIGTERM. Without that the exit status is still borked. So perhaps Hotspot JVM is not properly propagating the signal to the system-default handler? |
Ok so I'm in a bit of an argument about this so far with the Hotspot folks in-the-know. The claim currently is that because the JVM itself already handles TERM to perform a clean shutdown of the VM, these macros are not valid:
Obviously I don't agree with all this, and here's my logic:
I've replied again for clarification. To me the two halves of this process termination state are not matching, and I want to understand why. |
Thanks @headius for your effort. I guess there is no further updates yet on this one? |
@perlun don't know if you read the whole thread linked by @headius or not, but it seems this didn't go anywhere on the JVM side Here's the last message by Charlie: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2018-June/028966.html And the last message in the thread, with some ideas on how to workaround the JVM: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2018-June/028967.html |
Yeah this seems unlikely to be fixed at the JVM level, and at the very least this is a behavior we're not causing in any way, so there may be nothing we can do here. We are accurately reporting what the system codes report for the subprocess; it's the (Hotspot) subprocess that's mucking about with those codes. So this is, to me, a WONTFIX. I would like to see the JVM do it right, but I don't have the clout to make that happen. |
FWIW there may be a way to work around this using https://github.com/jruby/jruby-launcher, which boots the JVM directly with JNI and could be tweaked to un-break the signal handling for TERM by registering its own handler that re-raises it. We are not planning to work on that at this time. |
Thanks for the effort anyway @headius. For better or worse, I'm not using JRuby any more (moved over to plain Java instead), so this is not critical for me to resolve any more. Anyhow, good to get the details about the problem documented here for future use - thanks for documenting things the way you've done. 👍 |
Environment
Provide at least:
jruby -v
) and command line (flags, JRUBY_OPTS, etc): 9.2.0. JDK 8; I think the problem can happen on other Java versions also but this particular occurrence was on Java 8.uname -a
): Linux (Travis)Expected Behavior
This is a followup to #5134, where some of this was corrected but unfortunately not all.
This method:
...should work, and handle
SIGTERM
correctly. Which it does on MRI, all the time.Actual Behavior
Here is what I get. But note: only sometimes. 🤔
The text was updated successfully, but these errors were encountered: