OOM due to unbounded rescuePCs growth #4865

andrewdotn · 2017-11-23T23:46:10Z

Environment

Provide at least:

JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)

  $ jruby -v
  jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [darwin-x86_64]

Operating system and platform (e.g. uname -a)

Observed on both CentOS 7 and Mac OS 10.13.1

Actual Behavior

The following short program, extracted from a much larger application, runs out of memory due to unbounded growth in StartupInterpreterEngine.rescuePCs.

$ cat foo.rb 
require_relative 'bar'

run(Queue.new)
$ cat bar.rb 
def run(work_queue)
  while true
    begin
      work = work_queue.pop(true)
    rescue ThreadError => e
      next
    end
  end
end
$ time JRUBY_OPTS="-w -J-Xmx32m" jruby foo.rb
Error: Your application used more memory than the safety cap of 32M.
Specify -J-Xmx####M to increase it (#### = cap size in MB).
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3210)
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.Vector.grow(Vector.java:266)
	at java.util.Vector.ensureCapacityHelper(Vector.java:246)
	at java.util.Vector.addElement(Vector.java:620)
	at java.util.Stack.push(Stack.java:67)
	at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:101)
	at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84)
	at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:179)
	at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:165)
	at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:200)
	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:318)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:155)
	at foo.invokeOther3:run(foo.rb:3)
	at foo.RUBY$script(foo.rb:3)
	at java.lang.invoke.LambdaForm$DMH/989110044.invokeStatic_L7_L(LambdaForm$DMH)
	at java.lang.invoke.LambdaForm$BMH/1182461167.reinvoke(LambdaForm$BMH)
	at java.lang.invoke.LambdaForm$MH/198761306.invoker(LambdaForm$MH)
	at java.lang.invoke.LambdaForm$MH/1058025095.invokeExact_MT(LambdaForm$MH)
	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
	at org.jruby.ir.Compiler$1.load(Compiler.java:95)
	at org.jruby.Ruby.runScript(Ruby.java:828)
	at org.jruby.Ruby.runNormally(Ruby.java:747)
	at org.jruby.Ruby.runNormally(Ruby.java:765)
	at org.jruby.Ruby.runFromMain(Ruby.java:578)
	at org.jruby.Main.doRunFromMain(Main.java:417)
	at org.jruby.Main.internalRun(Main.java:305)
	at org.jruby.Main.run(Main.java:232)
	at org.jruby.Main.main(Main.java:204)

real	0m41.864s
user	0m51.344s
sys	0m0.495s
Returned 1.

The text was updated successfully, but these errors were encountered:

enebo · 2017-11-24T15:02:04Z

Wow this one is pretty odd. It does not happen without the second file??? I really expected this to be something where the next somehow is missing hitting a matching push to our pop but it looks like it works ok if called from a lexical parent but not if it is not.

andrewdotn · 2017-11-24T20:37:57Z

It doesn’t seem related to lexical scope, but to compiling. If I disable compilation, a second file isn’t needed:

$ cat foo.rb
while true
  begin
    raise "oops"
  rescue => e
    next
  end
end
$ jruby -J-Djruby.ir.debug -J-Djruby.compile.mode=OFF foo.rb 2>&1 | grep RPCs=1000
[main] INFO Interpreter : I: {15} toggle_backtrace(;true); <#RPCs=1000>
...

enebo · 2017-11-24T20:51:11Z

@andrewdotn ah yeah we compile main file by default. Thanks for that update.

andrewdotn · 2017-11-24T23:21:46Z

I also see leaks with

while true
  while true
    begin
      raise "oops"
    rescue => e
      break
    end
  end
end

and

while true
  begin
    raise "oops"
  rescue => e
    redo
  end
end

I’ve been trying to make IRBuilder add an extra ExceptionRegionEndMarkerInstr before the jump in these cases, but while they stop the leaks on the simple examples here, it causes real applications to break later in CFG, and I haven’t dug into that yet.

headius · 2017-11-27T15:54:00Z

This seems like a basic VM-level issue in our IR interpreter. The logic as compiled into JVM bytecode works just fine, because the management of try/catch regions is handled properly by the JVM. Any branch out of the "try" area automatically leaves that area, while for us it must be done manually at each exit point.

enebo · 2017-11-27T17:30:08Z

At first glance we create begin/end exc range which is great for building our CFG but then we try to leverage these as actual usable instrs in our startup interp, Their design is not really intended to pair up (e.g. start called then end called) as they are really for marking regions.

This is really only a problem in these cases that are infinite. It is also mildly ironic that we do not compile methods like this because they never exit so it is sort of an unhappy combination.

enebo · 2017-11-27T18:00:48Z

More notes. I tried pushing a new start exc only if it was different than top of stack (based on idea that lexically you could never raise to the same place twice) but this had issues with us prematurely popping too many elements. I cannot just ignore empty stack because I might have prematurely nuked a rescue ip we need.

@subbuss suggested another hack would be for start/end to both record the label they pair with and make sure to only pop if the push label matches.

subbuss · 2017-11-27T18:20:11Z

Deleted my previous comment because I don't think that simplified logic is right ... but yes, this is a problem of retrofitting the startup interpreter on top of the IR meant for CFG construction. The hacky soln should work for now. Will let @enebo test it. If there is a cleaner non-hacky soln that we can think of, we can use it instead.

The comment in this commit hopefully spells out how the solution works but one tl;dr is we cannot guarantee start and end exception region instructions will occur in the presence of branches or jumps. The workaround is to clean up the stack whenever we decide to add a new exception region. This had a small side-benefit of removing a hack we must have added when seeing a jump-like instr just doing a pop and hoping for the best.

enebo · 2017-11-27T20:26:36Z

@andrewdotn since you were digging in last week I take it you can test this and give it the gold seal of approval :)

Looking at this problem in retrospect it seems obvious our solution was pretty lacking but we would only be able to observe this in an endless loop function which was not the main program file (which we compile by default). Even a huge finite loop I bet took a long time to OOM.

… from fixing #4865.

headius · 2017-11-27T22:54:59Z

@andrewdotn Can you provide a test or spec that exercises this? Probably as part of either spec/compiler/general_spec.rb or something in test/jruby.

enebo · 2017-11-27T23:49:09Z

I don't think this is amenable to testing. Even with 32m heap it takes like 40 seconds to OOME

headius · 2017-11-28T00:33:18Z

@enebo We can write a test that checks the stack size, though. Maybe it should be in Java.

enebo · 2017-11-28T01:05:40Z

@headius it is a local variable.

headius · 2017-11-28T02:11:48Z

@enebo Ahh I didn't notice it was not part of the interpreter's instance state.

I tweaked the given script a bit to reduce allocation and eliminate backtrace generation, and it only takes about 5s to OOM with -Xmx32m.

args = [Exception, "foo", []]
while true
  begin
    raise *args
  rescue java::lang::OutOfMemoryError
    exit 1
  rescue Exception
    next
  end
end

It's still a bit clunky for a test.

andrewdotn · 2017-11-28T04:17:02Z

Wow, thanks for the amazing turnaround time on this! I can confirm that your fix works. When I was debugging this, I added a check inside StartupInterpreterEngine.interpret() that would panic if rescuePCs grew larger than 100 elements. I think there could be a Java unit test along those lines, e.g., by checking that after 100 loops rescuePCs was a sane size. It would need rescuePCs to be exposed for testing.

If I was adding a test for my own code here, and I wasn’t worried about performance, I’d add a Consumer<Integer> rescuePCsListenerForTesting parameter, along with an overload that would keep the existing signature by defaulting the new parameter to null. If non-null, the consumer would be passed the size of rescuePCs on every push, so that the test could verify it didn’t grow excessively during the loop. I know that would be inelegant but it would be fast and it would work.

…understand in original fix is that GEB and exception region for exceptions raised in ensures is that the all push the same label and nest. The fix is simple we now capture the instr itself since it is unique and continue using the pruning technique. The original solutions only mistake was not realizing we would nest regions to the same destination.

enebo · 2017-11-28T16:26:27Z

Ok so I found another issue but my confidence is quite a bit higher now. The reason I see no value in testing the stack size for unbounded growth is the algorithm on insertion will look to see if it is already on the stack and if so potentially reduce the size of the stack (but only grow if not present). So there should never be unbounded growth from that alone. However, we only bother to add these stack elements when we traverse a lexical section of code. I am super confident we will never be more than n lexical nestings of stack size.

enebo added the ir label Nov 24, 2017

enebo added this to the JRuby 9.1.15.0 milestone Nov 24, 2017

enebo closed this as completed in 98d1074 Nov 27, 2017

enebo added a commit that referenced this issue Nov 27, 2017

Jump no longer needs extra boolean parameter. This is further cleanup…

54f890d

… from fixing #4865.

enebo added a commit that referenced this issue Nov 27, 2017

Jump no longer needs extra boolean parameter. This is further cleanup…

545ca62

… from fixing #4865.

enebo mentioned this issue Jan 5, 2018

Regression in 9.1.15.0 with some ensure blocks being executed twice #4895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

OOM due to unbounded rescuePCs growth #4865

OOM due to unbounded rescuePCs growth #4865

andrewdotn commented Nov 23, 2017 •

edited

Loading

enebo commented Nov 24, 2017

andrewdotn commented Nov 24, 2017

enebo commented Nov 24, 2017

andrewdotn commented Nov 24, 2017

headius commented Nov 27, 2017

enebo commented Nov 27, 2017

enebo commented Nov 27, 2017

subbuss commented Nov 27, 2017

enebo commented Nov 27, 2017

headius commented Nov 27, 2017

enebo commented Nov 27, 2017

headius commented Nov 28, 2017

enebo commented Nov 28, 2017

headius commented Nov 28, 2017

andrewdotn commented Nov 28, 2017

enebo commented Nov 28, 2017

OOM due to unbounded rescuePCs growth #4865

OOM due to unbounded rescuePCs growth #4865

Comments

andrewdotn commented Nov 23, 2017 • edited Loading

Environment

Actual Behavior

enebo commented Nov 24, 2017

andrewdotn commented Nov 24, 2017

enebo commented Nov 24, 2017

andrewdotn commented Nov 24, 2017

headius commented Nov 27, 2017

enebo commented Nov 27, 2017

enebo commented Nov 27, 2017

subbuss commented Nov 27, 2017

enebo commented Nov 27, 2017

headius commented Nov 27, 2017

enebo commented Nov 27, 2017

headius commented Nov 28, 2017

enebo commented Nov 28, 2017

headius commented Nov 28, 2017

andrewdotn commented Nov 28, 2017

enebo commented Nov 28, 2017

andrewdotn commented Nov 23, 2017 •

edited

Loading