Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum stack depth appears to have worsened in 9k #3810

Closed
headius opened this issue Apr 19, 2016 · 12 comments
Closed

Maximum stack depth appears to have worsened in 9k #3810

headius opened this issue Apr 19, 2016 · 12 comments

Comments

@headius
Copy link
Member

headius commented Apr 19, 2016

Tested against JRuby 9.1, but JRuby 9.0.x likely also suffer from this.

I ran some numbers to test stack depth for #3741. We do poorly in JRuby 9.1 and should try to improve this.

All on Java 8u60, with the following code:

def foo(a)
  if a > 0
    foo(a - 1)
  else
    puts :success
  end
end
foo(N)

Here's interpreter. In 9.1 it is only the simple interpreter.

JRuby 1.7.24 JRuby 9.1
Default flags 650 700
with -Xss2048k 1350 1250

And with normal jit settings:

JRuby 1.7.24 JRuby 9.1
Default flags 1450 800
with -Xss2048k 3400 1700

This difference in 9.1 is a bit of a worry. The interpreter seems about the same (I'd hope it would be better, even though this is a simple AST). The compiler is significantly worse. A general reason for this may be additional frames around our IR interpreted and compiled paths that could be collapsed. The compiler may suffer from invokedynamic "lambda forms" bloating the stack for early calls.

@headius
Copy link
Member Author

headius commented Aug 22, 2016

With a simpler recursive method, current numbers for 9.1.3.0 and 1.7:

def foo(a)
  puts a
  foo(a + 1)
end

foo(1)

9.1.3.0 int: 1715-1740
1.7.25 int: 1757
9.1.3.0 jit: 1862
1.7.25 jit: 3498

The reduction in depth in the JIT is likely due to the way we're using call sites. In order to reduce the amount of bytecode in any one method body, the lazy construction and use of a CallSite is done in a synthetic method for every method call. This adds one frame compared to 1.7.

Here's the relevant traces for 9k and 1.7, interpreter and JIT:

9.1.3.0 int stack:

    at org.jruby.internal.runtime.methods.InterpretedIRMethod.INTERPRET_METHOD(InterpretedIRMethod.java:157)
    at org.jruby.internal.runtime.methods.InterpretedIRMethod.call(InterpretedIRMethod.java:148)
    at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:197)
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:161)
    at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:315)
    at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:159)
    at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84)
    at org.jruby.internal.runtime.methods.InterpretedIRMethod.INTERPRET_METHOD(InterpretedIRMethod.java:157)

1.7.25 int stack:

    at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
    at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:182)
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
    at org.jruby.ast.FCallOneArgNode.interpret(FCallOneArgNode.java:36)
    at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
    at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
    at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)

Despite having a "flat" interpreter, we use more frames in the current interpreter. Many of these frames could be boiled away, perhaps at the cost of more duplication.

Here's JIT:

9.1.3.0 jit stack:

    at blah.invokeOther5:foo(blah.rb:3)
    at blah.RUBY$method$foo$0(blah.rb:3)
    at org.jruby.internal.runtime.methods.CompiledIRMethod.invokeExact(CompiledIRMethod.java:241)
    at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:106)
    at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:197)
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:161)
    at blah.invokeOther5:foo(blah.rb:3)

1.7.25 jit stack:

    at blah.method__0$RUBY$foo(blah.rb:3)
    at blah$method__0$RUBY$foo.call(blah$method__0$RUBY$foo)
    at blah$method__0$RUBY$foo.call(blah$method__0$RUBY$foo)
    at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
    at blah.method__0$RUBY$foo(blah.rb:3)

You can see the invokeOther5:foo synthetic method in the 9k track as well as the fact that we don't generate invokers for Ruby scripts anymore; we use CompiledIRMethod plus method handles.

And this brings us to a bigger realization...if we run these numbers again in the same process, they improve.

9.1.3.0 jit, second run: 1951

They also improve even more if we run with invokedynamic and JIT:

9.1.3.0 jit + indy, first run: 1484
9.1.3.0 jit + indy, second run: 1558

And it continues to improve as the JVM JIT picks up more code.

So there's a few things we still can do to improve early stack depth in 9k.

@headius
Copy link
Member Author

headius commented Aug 22, 2016

Other notes:

  • The interpreter consumes a rather large stack frame right now, for all the register machine bits and bobs. This appears to be balanced out by the reduction in stack depth to do the actual interpretation.
  • JIT frames for Ruby methods are somewhat wider than they used to be, passing in more pieces of the method's metadata. This certainly has an effect.
  • The many sieve-like paths to get through the interpreter likely aggravate the use of a larger frame, since they all pass many of the same local variables through.

@headius headius modified the milestones: JRuby 9.1.4.0, JRuby 9.1.3.0 Aug 22, 2016
@headius
Copy link
Member Author

headius commented Aug 22, 2016

More work to come on this in 9.1.4.0.

@enebo enebo modified the milestones: JRuby 9.1.7.0, JRuby 9.1.6.0 Nov 7, 2016
headius added a commit to headius/jruby that referenced this issue Dec 13, 2016
@headius headius modified the milestones: JRuby 9.2.0.0, JRuby 9.1.7.0 Jan 4, 2017
@headius
Copy link
Member Author

headius commented Jan 4, 2017

Punting remaining work to 9.2. There are several small improvements in 9.1.7.0 that should help this, but we still have a lot of generic paths through code that result in deeper-than-necessary stacks.

@headius headius modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 15, 2018
@headius
Copy link
Member Author

headius commented Sep 19, 2018

Update since last visited.

Here's numbers on current versions (bin/jruby is 9.2.1.0, which we hope to release in the next couple weeks).

[] ~/projects/jruby $ rvm jruby-1.7.27 do ruby test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full StackOverflowError stack trace
3762
3763
3764
3765
37663766
[] ~/projects/jruby $ rvm jruby-1.7.27 do ruby --dev test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full StackOverflowError stack trace
1555
1556
1557
1558
1559

[] ~/projects/jruby $ rvm jruby-9.1.17.0 do ruby test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
2524
2525
2526
2527
2528

[] ~/projects/jruby $ rvm jruby-9.1.17.0 do ruby --dev test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
1468
1469
1470
1471
1472

[] ~/projects/jruby $ bin/jruby test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
2365
2366
2367
2368
2369

[] ~/projects/jruby $ bin/jruby --dev test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
1462
1463
1464
1465
1466

Compiled stack depth still shows degradation from 9.1 to 9.2. Interpreted may or may not show any degradation; these numbers vary depending on when then JVM JIT kicks in.

Both 9.1 and 9.2 are still much worse than 1.7.

Now look at stack with indy:

[] ~/projects/jruby $ bin/jruby -Xcompile.invokedynamic test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
2176
2177
2178
2179
2180

[] ~/projects/jruby $ bin/jruby -Xcompile.invokedynamic test_stack.rb | tail -5
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
2325
2326
2327
2328
2329

These numbers vary up and down by as much as 400 frames but 9.2 seems similar to 9.1.

The non-indy numbers could be improved by eliminating the synthetic method wrapping every call, but the logic for instantiating and invoking the call site would then have to live in every piece of emitted JVM bytecode, making more methods too big to load or optimize.

An all-indy approach with lighter-weight binding may be an option as well, since it should introduce no more than a couple frames, which all would be force inlined some time after the first call.

@headius
Copy link
Member Author

headius commented Sep 19, 2018

A comparison of the stack frames for 9.2 and 1.7 shows that 9.2 appears to have the same count of stack frames:

1.7:
	at test_stack.method__0$RUBY$foo(test_stack.rb:3)
	at test_stack$method__0$RUBY$foo.call(test_stack$method__0$RUBY$foo)
	at test_stack$method__0$RUBY$foo.call(test_stack$method__0$RUBY$foo)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
	at test_stack.method__0$RUBY$foo(test_stack.rb:3)

9.2:
	at test_stack.RUBY$method$foo$0(test_stack.rb:3)
	at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:166)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
	at test_stack.invokeOther2:foo(test_stack.rb:3)
	at test_stack.RUBY$method$foo$0(test_stack.rb:3)

However this is misleading since every call from CompiledIRMethod back into Ruby code is done via method handles. This may be an additional hidden consumer of stack space.

@headius headius modified the milestones: JRuby 9.2.1.0, JRuby 9.3.0.0 Sep 19, 2018
@headius
Copy link
Member Author

headius commented Mar 16, 2021

Latest numbers on Java 8. Numbers on higher JDKs are similar or slightly lower.

[] ~/projects/jruby $ jruby blah.rb | tail
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421

[] ~/projects/jruby $ jruby --dev blah.rb | tail
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485

Comparing with 1.7.27:

[] ~/projects/jruby $ rvm use jruby-1.7.27 do jruby blah.rb | tail
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full StackOverflowError stack trace
3165
3166
3167
3168
3169
3170
3171
3172
3173
31743174
[] ~/projects/jruby $ rvm use jruby-1.7.27 do jruby --dev blah.rb | tail
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full StackOverflowError stack trace
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559

The compiled dispatch appears to have degraded more significantly, which may be due to the use of synthetic call site methods in the bytecode. This reduces the bytecode size of jitted Ruby methods, but does add a frame to every dynamic call (on top of the frames added by the CallSite classes and the JavaMethod overloads). We may need to duplicate some logic to avoid excessively deepening the call stack in the name of code reuse.

Invokedynamic may help here once common call stacks get inlined, but until then it will aggravate the situation by requiring extra frames for LambdaForm until they coalesce.

@headius
Copy link
Member Author

headius commented Mar 16, 2021

This gets a bit misleading when looking at the actual traces, because LambdaForm frames are normally hidden from stack traces to avoid bloating them up.

1.7.27 jit mode:

	at blah.method__0$RUBY$foo(blah.rb:3)
	at blah$method__0$RUBY$foo.call(blah$method__0$RUBY$foo)
	at blah$method__0$RUBY$foo.call(blah$method__0$RUBY$foo)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
	at blah.method__0$RUBY$foo(blah.rb:3)

JRuby 9.3 jit mode:

	at blah.RUBY$method$foo$0(blah.rb:3)
	at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:198)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:172)
	at blah.invokeOther2:foo(blah.rb:3)
	at blah.RUBY$method$foo$0(blah.rb:3)

At a glance these look like they consume the same number of frames, but if we enable the extra LambdaForm frames it tells a different story:

	at blah.RUBY$method$foo$0(blah.rb:3)
	at java.lang.invoke.LambdaForm$DMH/1985836631.invokeStatic_L7_L(LambdaForm$DMH:1000028)
	at java.lang.invoke.LambdaForm$MH/1499136125.invokeExact_MT(LambdaForm$MH:1000027)
	at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:198)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:172)
	at blah.invokeOther2:foo(blah.rb:3)
	at blah.RUBY$method$foo$0(blah.rb:3)

In order to reduce the number of class loaded by the JIT, JRuby 9x began using method handles rather than generated stubs to bind compiled Ruby methods into a DynamicMethod object (in this case a CompiledIRMethod). Since the handle never inlines (called directly) it never eliminates these lambda frames, increasing stack depth.

We might be able to eliminate one frame by wrapping the handles in a LambdaMetaFactory interface implementation, which gives it a solid root for inlining.

@headius
Copy link
Member Author

headius commented Mar 16, 2021

Using LambdaMetaFactory does indeed seem to help the stack depth a ton:

	at blah.RUBY$method$foo$0(blah.rb:3)
	at blah$$Lambda$914/0x0000000800658040.call(<Unknown>:1000011)
	at org.jruby.dist/org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:242)
	at org.jruby.dist/org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:172)
	at blah.invokeOther2:foo(blah.rb:3)
	at blah.RUBY$method$foo$0(blah.rb:3)
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136

There's a good chance this will be faster as well.

Note that invokedynamic eliminates most of these frames fairly quickly, since it can bind directly to back to the foo method... but it does not do this right away (foo needs to jit and do some recursive inlining). Here is the trace running this test once:

	at blah.RUBY$method$foo$0(blah.rb:3)
	at java.base/java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(DirectMethodHandle$Holder:1000022)
	at java.base/java.lang.invoke.LambdaForm$MH/0x000000080065dc40.invoke(LambdaForm$MH:1000052)
	at java.base/java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder:1000019)
	at java.base/java.lang.invoke.LambdaForm$MH/0x000000080065b840.guard(LambdaForm$MH:1000082)
	at java.base/java.lang.invoke.DelegatingMethodHandle$Holder.delegate(DelegatingMethodHandle$Holder:1000019)
	at java.base/java.lang.invoke.LambdaForm$MH/0x000000080065b840.guard(LambdaForm$MH:1000082)
	at java.base/java.lang.invoke.Invokers$Holder.linkToCallSite(Invokers$Holder:1000019)
	at blah.RUBY$method$foo$0(blah.rb:3)

Stack depth maxes out at 1458. If we run it a few times, though I have seen stack depths as high as 35k, and the frames above melt away into the inlined code.

@headius
Copy link
Member Author

headius commented Mar 16, 2021

Our favorite method-dispatch benchmark shows a significant improvement dispatching through the lambda.

fib(35), jit, no indy:

BEFORE

0.532918
0.545872
0.534518
0.525865
0.526281
0.5266350000000001
0.525626

AFTER

0.447714
0.46713200000000005
0.469614
0.44882700000000003
0.449781
0.46717000000000003
0.462058
0.459892

headius added a commit to headius/jruby that referenced this issue Mar 16, 2021
This change modifies CompiledIRMethod to use a lambda-implemented
interface rather than calling through MethodHandle when the target
method handles are provided with a Lookup that can see the
compiled method class. This improves stack utilization and call
performance.

This mechanism currently only works for "def" that occurs in
compiled bytecode, since the compiled class is not visible to the
LambdaMetaFactory otherwise. Jitted methods will need to provide
their Lookup when being bound into a MixedModeIRMethod in order to
duplicate this logic.

See jruby#3810 for the initial motivation for this work.
headius added a commit to headius/jruby that referenced this issue May 17, 2021
This change modifies CompiledIRMethod to use a lambda-implemented
interface rather than calling through MethodHandle when the target
method handles are provided with a Lookup that can see the
compiled method class. This improves stack utilization and call
performance.

This mechanism currently only works for "def" that occurs in
compiled bytecode, since the compiled class is not visible to the
LambdaMetaFactory otherwise. Jitted methods will need to provide
their Lookup when being bound into a MixedModeIRMethod in order to
duplicate this logic.

See jruby#3810 for the initial motivation for this work.
@headius
Copy link
Member Author

headius commented May 24, 2021

#6621 turned out to be tricky to do across the board, since it depends heavily on having a Lookup object originating within the class where the method handle came from. I am punting additional work here to 9.4 since we can do a larger rework of how jitted methods get bound.

@headius headius modified the milestones: JRuby 9.3.0.0, JRuby 9.4.0.0 May 24, 2021
@headius
Copy link
Member Author

headius commented Nov 16, 2022

Latest results seem to show that 9k is doing just fine on stack size compared to 1.7. In the intervening years, we have reduced the complexity of IR interpretation, improved the JIT, added more specialized call paths to avoid cascading overloads, and many other improvements. In addition, the JIT code we produce has gotten better about keeping stack usage low.

Here's JRuby 9,4, 9.3, and 1.7 in JIT and interpreted modes. The 9.x releases are as good or better than the 1.7 release.

[] ~/work/jruby $ jruby -v -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 5.times { puts foo(1) }'
jruby 9.4.0.0-SNAPSHOT (3.1.0) 2022-11-16 1e9a159ab0 OpenJDK 64-Bit Server VM 11.0.16.1+1 on 11.0.16.1+1 +jit [x86_64-linux]
2239
2888
2888
3881
3881

[] ~/work/jruby $ rvm jruby-9.3.8.0 do jruby -v -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 5.times { puts foo(1) }'
jruby 9.3.8.0 (2.6.8) 2022-09-13 98d69c9461 OpenJDK 64-Bit Server VM 11.0.16.1+1 on 11.0.16.1+1 +jit [x86_64-linux]
2360
2957
2957
4006
4006

[] ~/work/jruby $ rvm jruby-1.7.27 do jruby -v -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 5.times { puts foo(1) }'
jruby 1.7.27 (1.9.3p551) 2017-05-11 8cdb01a on OpenJDK 64-Bit Server VM 11.0.16.1+1 +jit [linux-amd64]
failed to load native console support: Unsupported platform: x86_64-linux
io/console on JRuby shells out to stty for most operations
2342
2436
2436
2436
2436

[] ~/work/jruby $ jruby -v -X-C -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 5.times { puts foo(1) }'
jruby 9.4.0.0-SNAPSHOT (3.1.0) 2022-11-16 1e9a159ab0 OpenJDK 64-Bit Server VM 11.0.16.1+1 on 11.0.16.1+1 +jit [x86_64-linux]
1685
1723
1571
1571
1672

[] ~/work/jruby $ rvm jruby-9.3.8.0 do jruby -v -X-C -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 5.times { puts foo(1) }'
jruby 9.3.8.0 (2.6.8) 2022-09-13 98d69c9461 OpenJDK 64-Bit Server VM 11.0.16.1+1 on 11.0.16.1+1 +jit [x86_64-linux]
1791
1825
1654
1654
1655

[] ~/work/jruby $ rvm jruby-1.7.27 do jruby -v -X-C -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 5.times { puts foo(1) }'
jruby 1.7.27 (1.9.3p551) 2017-05-11 8cdb01a on OpenJDK 64-Bit Server VM 11.0.16.1+1 +jit [linux-amd64]
failed to load native console support: Unsupported platform: x86_64-linux
io/console on JRuby shells out to stty for most operations
1342
1410
1410
2560

Note also how JIT kicking in improves stack depth in both JIT and interpreted modes:

[] ~/work/jruby $ jruby -v -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 20.times { puts foo(1) }'
jruby 9.4.0.0-SNAPSHOT (3.1.0) 2022-11-16 1e9a159ab0 OpenJDK 64-Bit Server VM 11.0.16.1+1 on 11.0.16.1+1 +jit [x86_64-linux]
2187
2888
2888
3881
3881
3881
3881
3881
3881
3881
4967
4967
7305
7305
7305
7305
7305
7305
7305
7305

[] ~/work/jruby $ jruby -v -X-C -e 'def foo(n); begin; foo(n+1); rescue Object; n; end; end; 20.times { puts foo(1) }'
jruby 9.4.0.0-SNAPSHOT (3.1.0) 2022-11-16 1e9a159ab0 OpenJDK 64-Bit Server VM 11.0.16.1+1 on 11.0.16.1+1 +jit [x86_64-linux]
OpenJDK 64-Bit Server VM warning: Potentially dangerous stack overflow in ReservedStackAccess annotated method java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(I)I [1]
1701
1748
1591
1591
1591
1591
1723
2178
2178
2178
2178
2178
2178
2178
2178
2821
2821
2821
2821
2821

@headius headius closed this as completed Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants