StackOverflowError on repeated health checks #5228

egomolka · 2018-06-22T22:09:12Z

Hi:
I set up a middleware health check whick eventually triggers a stack overflow on JRuby.
The same health check works flawlessly on MRI Ruby.
I followed a template from ThisData.com: https://thisdata.com/blog/making-a-rails-health-check-that-doesnt-hit-the-database/

This health check is designed to have a very light touch, and doesn't access the database. I have tried two approaches; one matches the "ThisData" approach exactly, while the other makes a minor change in the health check class (the health check class file is attached).

I have used JMeter to have a single thread access the /health_check URL 1500 times, with a delay of 0.5 after each attempt.
I have also used a 1 second delay. The stack overflow typically occurs consistently shortly after 1100 accesses.

My production servers are on FreeBSD. I use HAProxy to perform the health checks there, and have used intervals up to 60 seconds. Regardless of the interval, the Java process eventually crashes, roughly in proportion to the check interval, with a signal 6 error:

Jun 18 04:28:31 appserver2 kernel: pid 25483 (java), uid 80: exited on signal 6

JRuby version: JRuby 9.1.15.0/JRuby 9.2.0.0 (tested on both with identical results)
JAVA_OPTS="-Xmn1024m -ms3072m -mx3072m" (I have tried others)
Rails: 4.2.10

Production environment

FreeBSD: 11.1-RELEASE-p10
Java version on FreeBSD:
openjdk version "1.8.0_162"
OpenJDK Runtime Environment (build 1.8.0_162-b12)
OpenJDK 64-Bit Server VM (build 25.162-b12, mixed mode

Development environment

Debian: Linux bdrld-debian 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
Java version on Debian:
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

I have attached a zip file containing the following:

JMeter test plan
JMeter results for JRuby
JMeter results for MRI Ruby
Ruby class file for the health check
Exception details when I try to access the health check page after JMeter starts reporting errors
Exception details when I try to access the login page after JMeter starts reporting errors

health_check_failure.zip
health_check_failure.zip

The text was updated successfully, but these errors were encountered:

headius · 2018-06-25T18:07:02Z

Well it sounds like something is causing it to build a deeper and deeper stack as it goes and eventually at 1100 or so hits it's too deep to execute.

Looking at the stack trace, I see a lot of this:

$_dot_rbenv.versions.jruby_minus_9_dot_2_dot_0_dot_0.lib.ruby.gems.shared.gems.rack_minus_1_dot_6_dot_10.lib.rack.body_proxy_dot_rb.invokeOther7:respond_to?(/home/edg/.rbenv/versions/jruby-9.2.0.0/lib/ruby/gems/shared/gems/rack-1.6.10/lib/rack/body_proxy.rb:9)
home.edg.$_dot_rbenv.versions.jruby_minus_9_dot_2_dot_0_dot_0.lib.ruby.gems.shared.gems.rack_minus_1_dot_6_dot_10.lib.rack.body_proxy_dot_rb.RUBY$method$respond_to?$0(/home/edg/.rbenv/versions/jruby-9.2.0.0/lib/ruby/gems/shared/gems/rack-1.6.10/lib/rack/body_proxy.rb:9)

From this is looks like a respond_to? in Rack is calling back to itself infinitely. The impl of that method is here:

    def respond_to?(*args)
      return false if args.first.to_s =~ /^to_ary$/
      super or @body.respond_to?(*args)
    end

It would seem to me that the @body here is getting wrapped and re-wrapped with proxies repeatedly until it's simply too deep. But I'm not sure how that would happen. Does something about the health check cause a new proxy to be added every time? Is this middleware being added over and over again perhaps?

headius · 2018-06-25T18:10:14Z

A few ideas for investigating this:

pp the @Body and confirm that the chain is growing.
Track creations of the BodyProxy and see if more are being created than should be. Throw a caller dump in there to show the stack where it's being created too much.

Note that the rack code is not significantly different in recent versions.

egomolka · 2018-06-25T22:59:21Z

I looked at a dump the other day, and saw only one copy of the MiddlewareHealthCheck object; however,
I ran more stress tests against MRI Ruby today, and was able to trigger an error on the 2'727th access. I ran the test twice, and got the same error at exactly the same point.
Given that I have succeeded in replicating the problem on MRI Ruby. I am closing the issue.
Thanks for your comments. I'll bear them in mind as I dig deeper into this.

headius · 2018-06-25T23:02:20Z

Thank you for verifying on MRI...I'm glad it did not turn out to be a bug in JRuby.

egomolka closed this as completed Jun 25, 2018

enebo added this to the Invalid or Duplicate milestone Nov 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StackOverflowError on repeated health checks #5228

StackOverflowError on repeated health checks #5228

egomolka commented Jun 22, 2018

headius commented Jun 25, 2018

headius commented Jun 25, 2018

egomolka commented Jun 25, 2018

headius commented Jun 25, 2018

StackOverflowError on repeated health checks #5228

StackOverflowError on repeated health checks #5228

Comments

egomolka commented Jun 22, 2018

Production environment

Development environment

headius commented Jun 25, 2018

headius commented Jun 25, 2018

egomolka commented Jun 25, 2018

headius commented Jun 25, 2018