Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGTERM still has intermittently wrong behavior w/ $CHILD_STATUS.termsig #5224

Closed
perlun opened this issue Jun 21, 2018 · 19 comments
Closed
Milestone

Comments

@perlun
Copy link
Contributor

perlun commented Jun 21, 2018

Environment

Provide at least:

  • JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc): 9.2.0. JDK 8; I think the problem can happen on other Java versions also but this particular occurrence was on Java 8.
  • Operating system and platform (e.g. uname -a): Linux (Travis)

Expected Behavior

This is a followup to #5134, where some of this was corrected but unfortunately not all.

This method:

def run_command(command_line)
  result = system(command_line)

  if result.nil?
    raise 'Command execution failed'
  elsif result == false
    if $CHILD_STATUS.termsig == Signal.list['TERM']
      # This is fine; the process self-terminated as expected
    else
      raise "Command execution failed; return status was #{$CHILD_STATUS}"
    end
  end
end

...should work, and handle SIGTERM correctly. Which it does on MRI, all the time.

Actual Behavior

Here is what I get. But note: only sometimes. 🤔

Command execution failed; return status was pid 7066 exit 143
/home/travis/build/ecraft/uxfactory-server/Rakefile:24:in `run_command'
/home/travis/build/ecraft/uxfactory-server/Rakefile:40:in `block in (root)'
/home/travis/build/ecraft/uxfactory-server/vendor/bundle/jruby/2.5.0/gems/rake-12.3.1/exe/rake:27:in `<main>'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/cli/exec.rb:1:in `(root)'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/cli/exec.rb:74:in `kernel_load'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/cli/exec.rb:28:in `run'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/cli.rb:424:in `exec'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/vendor/thor/lib/thor.rb:387:in `dispatch'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/cli.rb:27:in `dispatch'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/vendor/thor/lib/thor/base.rb:466:in `block in start'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/cli.rb:18:in `start'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/exe/bundle:30:in `<main>'
/home/travis/.rvm/gems/jruby-9.2.0.0/gems/bundler-1.16.2/lib/bundler/friendly_errors.rb:124:in `<eval>'
/home/travis/.rvm/gems/jruby-9.2.0.0/bin/jruby_executable_hooks:15:in `<main>'
@headius
Copy link
Member

headius commented Jun 25, 2018

Sorry I missed your comment on the other issue, I did not notice we still had some issues. Poking around...

@headius
Copy link
Member

headius commented Jun 25, 2018

Ok so as far as I can tell we are doing the right thing on the child side; the exit status is 15 + 128 (143) matching what MRI (and other processes) do for a signal-related termination.

The issue with the spec for exitstatus seems to be in the parent; if it did not terminate normally it should return nil, but it's still returning the raw status code.

1)
Process::Status#exitstatus for a child that raised SignalException returns nil FAILED
Expected 143
 to equal nil

Your case above appears to be similar in that it's seeing the raw return code rather than the signal status.

@headius
Copy link
Member

headius commented Jun 25, 2018

These results are very confusing. Even though it seems like we're doing the right thing in the child, just swapping from CRuby to JRuby in the child (using a CRuby parent) causes the following results:

[] ~/projects/jruby $ rvm ruby-2.5 do ruby sig.rb ; echo $?
Terminated: 15
143

[] ~/projects/jruby $ jruby sig.rb ; echo $?
143

[] ~/projects/jruby $ rvm ruby-2.5 do ruby -e "system 'ruby sig.rb'; \$?.tap {|s| p s.to_i, s.stopsig, s.termsig, s.exitstatus }"
15
nil
15
nil

[] ~/projects/jruby $ rvm ruby-2.5 do ruby -e "system 'jruby sig.rb'; \$?.tap {|s| p s.to_i, s.stopsig, s.termsig, s.exitstatus }"
36608
nil
nil
143

The 36608 value here appears to be the return code shifted left 8 bits, which as far as I can tell from CRuby source should only happen for failed spawns, and that's still on the parent side. With our child returning a proper error code I'm baffled why CRuby would have such different results here. What am I missing?

@headius
Copy link
Member

headius commented Jun 25, 2018

FTR, using the native launcher instead of the bash script does not fix the problem:

[] ~/projects/jruby $ gem install jruby-launcher
Building native extensions. This could take a while...
Successfully installed jruby-launcher-1.1.5-java
1 gem installed

[] ~/projects/jruby $ rvm ruby-2.5 do ruby -e "system 'jruby sig.rb'; \$?.tap {|s| p s.to_i, s.stopsig, s.termsig, s.exitstatus }"
36608
nil
nil
143

@headius
Copy link
Member

headius commented Jun 25, 2018

Also, don't know what would cause this to intermittently fail for you unless the child process is only sometimes exiting with a signal return code. As far as I can see that would only happen if some other error were raised or the process terminated normally.

@headius
Copy link
Member

headius commented Jun 25, 2018

Worth pointing out that JRuby parent with CRuby child seems to work fine, making this even more baffling to me:

[] ~/projects/jruby $ jruby -e "system 'rvm ruby-2.5 do ruby sig.rb'; \$?.tap {|s| p s.to_i, s.stopsig, s.termsig, s.exitstatus }"
15
nil
15
nil

@headius
Copy link
Member

headius commented Jun 25, 2018

Ah-ha...I think I figured it out. The JVM's standard mechanism for exiting basically just calls the system exit call, which apparently does not set process statuses correctly?

I found this link: https://www.gnu.org/software/libc/manual/html_node/Termination-in-Handler.html

With this text:

/* Now reraise the signal.  We reactivate the signal’s
     default handling, which is to terminate the process.
     We could just call exit or abort,
     but reraising the signal sets the return status
     from the process correctly. */
  signal (sig, SIG_DFL);
  raise (sig);

So I think the right thing to do is re-propagate the signal using equivalent code in our Main class so the remaining bits (whatever they are) get set right.

@headius
Copy link
Member

headius commented Jun 25, 2018

Ok, I added raise to jnr/jnr-posix and I'm calling it...but it doesn't seem to be helping. I am confused about what the remaining exit logic should do; in CRuby it seems to always return either 0 or 1, but then the exit status of the process shows up properly as the masked signal. I am missing something.

headius added a commit to jnr/jnr-posix that referenced this issue Jun 25, 2018
@headius
Copy link
Member

headius commented Jun 25, 2018

Whew, ok...I think I've figured out that this is not our fault. Here's an example using no JRuby at all...the subprocess is a Java program that just loops forever until killed or terminated. The resulting exit status is the same as JRuby.

[] ~/projects/jruby $ rvm ruby-2.5 do ruby -e "system 'java Blah'; \$?.tap {|s| p s.to_i, s.stopsig, s.termsig, s.exitstatus }"
36608
nil
nil
143

[] ~/projects/jruby $ cat Blah.java
public class Blah {
public static void main(String[] args) {
while (true) { }
}
}

So it seems like the standard exit/termination process for a JVM does not do the right thing for SIGTERM at the very least.

@headius
Copy link
Member

headius commented Jun 25, 2018

I've posted a question about this to an internal Red Hat JVM maintainers list for more help.

@perlun
Copy link
Contributor Author

perlun commented Jun 26, 2018

@headius

I've posted a question about this to an internal Red Hat JVM maintainers list for more help.

Thanks a lot for all the help in this! Please keep us posted if you find something that can help.

Looked at the logs for the intermittent errors, but I can't even seem to find the error occurring any more. More so, rspec errors which fail because we randomize the order. 😏 So, I'm note sure about the "intermittent" part any more.

Either way, we can both see above that there seems to be an issue with this, be it intermittent or not.

@headius
Copy link
Member

headius commented Jun 26, 2018

Ok so I've posted to a public list because the folks inside Red Hat suggested I do so: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2018-June/028864.html

This removes Ruby from the equation altogether, and I think shows pretty poor behavior on the part of Hotspot.

One of the Red Hat folks did discover that the -XX:+ReduceSignalUsage JVM flag appears to make things behave better. However that alone does not appear to be enough; something in my local patch also is necessary. With my local patch and that flag, the specs pass and exit statuses appear to work properly.

I'll keep ya posted!

@headius
Copy link
Member

headius commented Jun 26, 2018

It appears the magic piece of my patch is actually calling the raise(3) function to allow the system to also handle the SIGTERM. Without that the exit status is still borked.

So perhaps Hotspot JVM is not properly propagating the signal to the system-default handler?

@headius
Copy link
Member

headius commented Jun 28, 2018

Ok so I'm in a bit of an argument about this so far with the Hotspot folks in-the-know.

The claim currently is that because the JVM itself already handles TERM to perform a clean shutdown of the VM, these macros are not valid:

  • Because the termination signal was handled by Hotspot, it was not a termination due to an unhandled signal. So exited? is true and termsig is zero.
  • stopsig is nonsense but that's OK because exited? was true indicating a normal termination.
  • The command-line exit result of 143 is correct (as we know). And the exitstatus of 143 is also correct because the JVM did explicitly call exit(143).

Obviously I don't agree with all this, and here's my logic:

  • The exit result at command line (143) was non-zero, and equal to 128+N where N is a signal. That means termination was at least because of that signal, and at most that termination was not normal.
  • Because the termination was due to a signal, either termsig or stopsig should reflect that value.
  • If the termination is considered abnormal, then exited? should also be false.

I've replied again for clarification. To me the two halves of this process termination state are not matching, and I want to understand why.

@perlun
Copy link
Contributor Author

perlun commented Aug 13, 2018

Thanks @headius for your effort. I guess there is no further updates yet on this one?

@dentarg
Copy link

dentarg commented May 3, 2019

@perlun don't know if you read the whole thread linked by @headius or not, but it seems this didn't go anywhere on the JVM side

Here's the last message by Charlie: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2018-June/028966.html

And the last message in the thread, with some ideas on how to workaround the JVM: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2018-June/028967.html

@headius
Copy link
Member

headius commented May 14, 2019

Yeah this seems unlikely to be fixed at the JVM level, and at the very least this is a behavior we're not causing in any way, so there may be nothing we can do here. We are accurately reporting what the system codes report for the subprocess; it's the (Hotspot) subprocess that's mucking about with those codes.

So this is, to me, a WONTFIX. I would like to see the JVM do it right, but I don't have the clout to make that happen.

@headius headius closed this as completed May 14, 2019
@headius headius added this to the Won't Fix milestone May 14, 2019
@headius
Copy link
Member

headius commented May 14, 2019

FWIW there may be a way to work around this using https://github.com/jruby/jruby-launcher, which boots the JVM directly with JNI and could be tweaked to un-break the signal handling for TERM by registering its own handler that re-raises it. We are not planning to work on that at this time.

@perlun
Copy link
Contributor Author

perlun commented May 16, 2019

Thanks for the effort anyway @headius. For better or worse, I'm not using JRuby any more (moved over to plain Java instead), so this is not critical for me to resolve any more. Anyhow, good to get the details about the problem documented here for future use - thanks for documenting things the way you've done. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants