Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflowError on repeated health checks #5228

Closed
egomolka opened this issue Jun 22, 2018 · 4 comments
Closed

StackOverflowError on repeated health checks #5228

egomolka opened this issue Jun 22, 2018 · 4 comments

Comments

@egomolka
Copy link

Hi:
I set up a middleware health check whick eventually triggers a stack overflow on JRuby.
The same health check works flawlessly on MRI Ruby.
I followed a template from ThisData.com: https://thisdata.com/blog/making-a-rails-health-check-that-doesnt-hit-the-database/

This health check is designed to have a very light touch, and doesn't access the database. I have tried two approaches; one matches the "ThisData" approach exactly, while the other makes a minor change in the health check class (the health check class file is attached).

I have used JMeter to have a single thread access the /health_check URL 1500 times, with a delay of 0.5 after each attempt.
I have also used a 1 second delay. The stack overflow typically occurs consistently shortly after 1100 accesses.

My production servers are on FreeBSD. I use HAProxy to perform the health checks there, and have used intervals up to 60 seconds. Regardless of the interval, the Java process eventually crashes, roughly in proportion to the check interval, with a signal 6 error:

Jun 18 04:28:31 appserver2 kernel: pid 25483 (java), uid 80: exited on signal 6

JRuby version: JRuby 9.1.15.0/JRuby 9.2.0.0 (tested on both with identical results)
JAVA_OPTS="-Xmn1024m -ms3072m -mx3072m" (I have tried others)
Rails: 4.2.10

Production environment

FreeBSD: 11.1-RELEASE-p10
Java version on FreeBSD:
openjdk version "1.8.0_162"
OpenJDK Runtime Environment (build 1.8.0_162-b12)
OpenJDK 64-Bit Server VM (build 25.162-b12, mixed mode

Development environment

Debian: Linux bdrld-debian 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
Java version on Debian:
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

I have attached a zip file containing the following:

  • JMeter test plan
  • JMeter results for JRuby
  • JMeter results for MRI Ruby
  • Ruby class file for the health check
  • Exception details when I try to access the health check page after JMeter starts reporting errors
  • Exception details when I try to access the login page after JMeter starts reporting errors

health_check_failure.zip
health_check_failure.zip

@headius
Copy link
Member

headius commented Jun 25, 2018

Well it sounds like something is causing it to build a deeper and deeper stack as it goes and eventually at 1100 or so hits it's too deep to execute.

Looking at the stack trace, I see a lot of this:

$_dot_rbenv.versions.jruby_minus_9_dot_2_dot_0_dot_0.lib.ruby.gems.shared.gems.rack_minus_1_dot_6_dot_10.lib.rack.body_proxy_dot_rb.invokeOther7:respond_to?(/home/edg/.rbenv/versions/jruby-9.2.0.0/lib/ruby/gems/shared/gems/rack-1.6.10/lib/rack/body_proxy.rb:9)
home.edg.$_dot_rbenv.versions.jruby_minus_9_dot_2_dot_0_dot_0.lib.ruby.gems.shared.gems.rack_minus_1_dot_6_dot_10.lib.rack.body_proxy_dot_rb.RUBY$method$respond_to?$0(/home/edg/.rbenv/versions/jruby-9.2.0.0/lib/ruby/gems/shared/gems/rack-1.6.10/lib/rack/body_proxy.rb:9)

From this is looks like a respond_to? in Rack is calling back to itself infinitely. The impl of that method is here:

    def respond_to?(*args)
      return false if args.first.to_s =~ /^to_ary$/
      super or @body.respond_to?(*args)
    end

It would seem to me that the @body here is getting wrapped and re-wrapped with proxies repeatedly until it's simply too deep. But I'm not sure how that would happen. Does something about the health check cause a new proxy to be added every time? Is this middleware being added over and over again perhaps?

@headius
Copy link
Member

headius commented Jun 25, 2018

A few ideas for investigating this:

  • pp the @Body and confirm that the chain is growing.
  • Track creations of the BodyProxy and see if more are being created than should be. Throw a caller dump in there to show the stack where it's being created too much.

Note that the rack code is not significantly different in recent versions.

@egomolka
Copy link
Author

I looked at a dump the other day, and saw only one copy of the MiddlewareHealthCheck object; however,
I ran more stress tests against MRI Ruby today, and was able to trigger an error on the 2'727th access. I ran the test twice, and got the same error at exactly the same point.
Given that I have succeeded in replicating the problem on MRI Ruby. I am closing the issue.
Thanks for your comments. I'll bear them in mind as I dig deeper into this.

@headius
Copy link
Member

headius commented Jun 25, 2018

Thank you for verifying on MRI...I'm glad it did not turn out to be a bug in JRuby.

@enebo enebo added this to the Invalid or Duplicate milestone Nov 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants