-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximum stack depth appears to have worsened in 9k #3810
Comments
With a simpler recursive method, current numbers for 9.1.3.0 and 1.7: def foo(a)
puts a
foo(a + 1)
end
foo(1) 9.1.3.0 int: 1715-1740 The reduction in depth in the JIT is likely due to the way we're using call sites. In order to reduce the amount of bytecode in any one method body, the lazy construction and use of a CallSite is done in a synthetic method for every method call. This adds one frame compared to 1.7. Here's the relevant traces for 9k and 1.7, interpreter and JIT: 9.1.3.0 int stack:
1.7.25 int stack:
Despite having a "flat" interpreter, we use more frames in the current interpreter. Many of these frames could be boiled away, perhaps at the cost of more duplication. Here's JIT: 9.1.3.0 jit stack:
1.7.25 jit stack:
You can see the And this brings us to a bigger realization...if we run these numbers again in the same process, they improve. 9.1.3.0 jit, second run: 1951 They also improve even more if we run with invokedynamic and JIT: 9.1.3.0 jit + indy, first run: 1484 And it continues to improve as the JVM JIT picks up more code. So there's a few things we still can do to improve early stack depth in 9k. |
Other notes:
|
More work to come on this in 9.1.4.0. |
Improves stack depth for jruby#3810.
Punting remaining work to 9.2. There are several small improvements in 9.1.7.0 that should help this, but we still have a lot of generic paths through code that result in deeper-than-necessary stacks. |
Update since last visited. Here's numbers on current versions (bin/jruby is 9.2.1.0, which we hope to release in the next couple weeks).
Compiled stack depth still shows degradation from 9.1 to 9.2. Interpreted may or may not show any degradation; these numbers vary depending on when then JVM JIT kicks in. Both 9.1 and 9.2 are still much worse than 1.7. Now look at stack with indy:
These numbers vary up and down by as much as 400 frames but 9.2 seems similar to 9.1. The non-indy numbers could be improved by eliminating the synthetic method wrapping every call, but the logic for instantiating and invoking the call site would then have to live in every piece of emitted JVM bytecode, making more methods too big to load or optimize. An all-indy approach with lighter-weight binding may be an option as well, since it should introduce no more than a couple frames, which all would be force inlined some time after the first call. |
A comparison of the stack frames for 9.2 and 1.7 shows that 9.2 appears to have the same count of stack frames:
However this is misleading since every call from CompiledIRMethod back into Ruby code is done via method handles. This may be an additional hidden consumer of stack space. |
Latest numbers on Java 8. Numbers on higher JDKs are similar or slightly lower.
Comparing with 1.7.27:
The compiled dispatch appears to have degraded more significantly, which may be due to the use of synthetic call site methods in the bytecode. This reduces the bytecode size of jitted Ruby methods, but does add a frame to every dynamic call (on top of the frames added by the CallSite classes and the JavaMethod overloads). We may need to duplicate some logic to avoid excessively deepening the call stack in the name of code reuse. Invokedynamic may help here once common call stacks get inlined, but until then it will aggravate the situation by requiring extra frames for LambdaForm until they coalesce. |
This gets a bit misleading when looking at the actual traces, because LambdaForm frames are normally hidden from stack traces to avoid bloating them up. 1.7.27 jit mode:
JRuby 9.3 jit mode:
At a glance these look like they consume the same number of frames, but if we enable the extra LambdaForm frames it tells a different story:
In order to reduce the number of class loaded by the JIT, JRuby 9x began using method handles rather than generated stubs to bind compiled Ruby methods into a DynamicMethod object (in this case a CompiledIRMethod). Since the handle never inlines (called directly) it never eliminates these lambda frames, increasing stack depth. We might be able to eliminate one frame by wrapping the handles in a LambdaMetaFactory interface implementation, which gives it a solid root for inlining. |
Using LambdaMetaFactory does indeed seem to help the stack depth a ton:
There's a good chance this will be faster as well. Note that invokedynamic eliminates most of these frames fairly quickly, since it can bind directly to back to the
Stack depth maxes out at 1458. If we run it a few times, though I have seen stack depths as high as 35k, and the frames above melt away into the inlined code. |
Our favorite method-dispatch benchmark shows a significant improvement dispatching through the lambda. fib(35), jit, no indy: BEFORE
AFTER
|
This change modifies CompiledIRMethod to use a lambda-implemented interface rather than calling through MethodHandle when the target method handles are provided with a Lookup that can see the compiled method class. This improves stack utilization and call performance. This mechanism currently only works for "def" that occurs in compiled bytecode, since the compiled class is not visible to the LambdaMetaFactory otherwise. Jitted methods will need to provide their Lookup when being bound into a MixedModeIRMethod in order to duplicate this logic. See jruby#3810 for the initial motivation for this work.
This change modifies CompiledIRMethod to use a lambda-implemented interface rather than calling through MethodHandle when the target method handles are provided with a Lookup that can see the compiled method class. This improves stack utilization and call performance. This mechanism currently only works for "def" that occurs in compiled bytecode, since the compiled class is not visible to the LambdaMetaFactory otherwise. Jitted methods will need to provide their Lookup when being bound into a MixedModeIRMethod in order to duplicate this logic. See jruby#3810 for the initial motivation for this work.
#6621 turned out to be tricky to do across the board, since it depends heavily on having a Lookup object originating within the class where the method handle came from. I am punting additional work here to 9.4 since we can do a larger rework of how jitted methods get bound. |
Latest results seem to show that 9k is doing just fine on stack size compared to 1.7. In the intervening years, we have reduced the complexity of IR interpretation, improved the JIT, added more specialized call paths to avoid cascading overloads, and many other improvements. In addition, the JIT code we produce has gotten better about keeping stack usage low. Here's JRuby 9,4, 9.3, and 1.7 in JIT and interpreted modes. The 9.x releases are as good or better than the 1.7 release.
Note also how JIT kicking in improves stack depth in both JIT and interpreted modes:
|
Tested against JRuby 9.1, but JRuby 9.0.x likely also suffer from this.
I ran some numbers to test stack depth for #3741. We do poorly in JRuby 9.1 and should try to improve this.
All on Java 8u60, with the following code:
Here's interpreter. In 9.1 it is only the simple interpreter.
And with normal jit settings:
This difference in 9.1 is a bit of a worry. The interpreter seems about the same (I'd hope it would be better, even though this is a simple AST). The compiler is significantly worse. A general reason for this may be additional frames around our IR interpreted and compiled paths that could be collapsed. The compiler may suffer from invokedynamic "lambda forms" bloating the stack for early calls.
The text was updated successfully, but these errors were encountered: