-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yield is slower than MRI in 9k #2442
Comments
Analyzing the results a bit more... 9k performs nearly identically for both yield and call, which leads me to think it's the same thing slowing both of them down. call is a bit slower than yield in 1.7 but much of that logic has been cleaned up. I suspect a fix for one will be a fix for both in 9k. I started by running a sampled JVM-level profile (with Interpreter results were not helpful, due to most of the overhead showing up in the main interpreter loop (as can happen with sampling). Compiled results, however, were more interesting:
The only way there could be that many samples inside CompiledIRBlockBody. is if we're creating it anew for every instantiation of a block. And that's not surprising, since I never implemented compiled block body caching in either JIT mode. Let's start with that. |
Heh:
|
Looks like that was the lion's share of the performance hit:
However yield performance still isn't back up to where it was in 1.7 (40-50% fewer ips). The new sampled profile doesn't give us much to go on, though:
In general the bodies of the jitted Ruby code are not useful in a sampled profile because they'll inline a lot of irreducible overhead and do all the routing of data from call to call. The only interesting line here is CompiledIRBlockBody.commonYieldPath, code run for all calls to JIT-compiled blocks. Unfortunately in a sampled profile, this is may just be normal runtime overhead, so we need to go deeper. Next thing to try will be an allocation profile. |
FWIW LambdaForm in the previous profile is probably not useful because it's irreducible overhead of the JVM's invokedynamic plumbing. Might be worth pointing out to Hotspot devs, but they already know LambdaForms show up in our profiles a lot. |
For allocation profiling, I use the builtin "hprof" extension for the JVM, using the following command line:
Depth here indicates how many levels of stack should be use to uniquely identify allocation traces. Running this against the full benchmark (not just the yield portion, since we'd like to improve Proc#call as well), yields the following top allocations:
From this last entry down are mostly one-time-only allocations to support booting JRuby and making the script runnable. As for the top entries, we'll go down the list... The Fixnum allocations here are likely from benchmark/ips itself, counting iterations and such. The allocation trace also shows we're using the specialized Fixnum operator logic for
The next two items represent dynamic scopes where we store variables that must be visible on the heap (i.e. visible to multiple lexical bodies of code). Dummy scopes are used when a heap reference must be maintained to a lexically-higher scope but none of the referer's variables need to be on the heap. The dummy scopes being allocated show something very interesting:
Proc construction is triggering the allocation of a dummy scope at initialization time, rather than on-demand. This should only affect Proc#call performance, since normally a non-"reified" block never needs to become a full Proc. We can confirm that by running just the yield benchmark under allocation profiling.
In this profile, Binding is likely irreducible, since we will have to allocate a binding for every first call to the block. However, if we're allocating one for the n+1 calls too, we may be able to reduce that overhead by reusing the binding. This is a bit trickier and requires some cooperation with the interpreter and the whole IR runtime, so we'll skip that line for now. The dummy scopes are essentially unchanged. This is worth looking into. |
In the above profile, RubyProc is showing up as being allocated about as many times as Binding. Since we're measuring yield for that run, it indicates we're creating a Proc object even for cases that should not need to become a full Proc. The trace for these allocations shows them happening at "unknown line" in our original script, which likely means they're from either the literal blocks in the benchmark code or the methods that receive them. A deeper allocation trace, with depth=20, shows where in the benchmark triggers this proc creation.
Line 26 of the modified benchmark, in context:
So the literal block is triggering a full RubyProc object every time. This is excessive, so we'll explore it. |
The LambdaForm lines in between our script line (bench2.rb:26) indicate that the Proc object is not being created directly by the script body, but by an invokedynamic call it makes. We need to look at bytecode. We can dump bytecode for a snippit of Ruby code by passing --bytecode to JRuby. Here's a command line and the relevant bits of the output for a simple literal block:
The only invokedynamic that shows up in this small snippit of code is the call to
The last line here is our yield, and it appears after The yieldSpecific method just unwraps any RubyProc object it receives, so creating that extra layer is obviously unnecessary here. I'll look into that next. |
In order to see why we're creating this proc, we need to go to the next level up: IR instructions. We can do this using
The The logic for ReceiveClosureInstr ( case RECV_CLOSURE:
result = IRRuntimeHelpers.newProc(context.runtime, block);
setResult(temp, currDynScope, instr.getResult(), result);
return; So this instruction is currently expected to create a Proc object for every passed-in block, literal or otherwise. In order to eliminate this extra overhead, we must make sure that all |
After some discussion with @subbuss and @enebo, we realized that in order to eliminate this proc creation we need to make the IR understand the implicit block argument passed into every body of code. This necessitated creating two new instructions:
The By making the IR know about this implicit block variable (either passed on JVM stack or stored in our artificial frame), we never have to instantiate a Proc unless you're actually going to need a Proc. That's a good thing. |
Previously, because we only had one yield instruction and one way to prepare any incoming block, the lowest-common-denominator was to always turn passed blocks into a Proc object before sending it on to a yield. This is less overhead than in 1.7, and resulted in our yield performance being worse than it should be. See #2442. In order to fix this I have made the following changes. * ReceiveClosure, the instruction which reifies an implicit block into a Proc, has been renamed to ReifyClosure. It is now used only for block args, as in def foo(&b). * LoadImplicitClosure and LoadFrameClosure are added to retrieve a block from either the JVM stack or from our frame stack. Method bodies and singleton class bodies use the stack, and all other scopes use frame. This allows us to load the block used for yielding in an efficient way regardless of which scope we are in. * IRScope temporarily aggregates two variables: one for the "yield" closure, the one used for yielding (which may come from either JVM stack or our frame); and one for the "implicit" closure, which is the one passed directly to this scope on the JVM stack. This additional closure is needed to represent two-block scopes like a block that receives a block. * ProcessModuleBody now has an operand to load a block passed to the module/class body. For normal module and class bodies this is always NullBlock (new operand for Block.NULL_BLOCK), but singleton class bodies (class << obj) inherit the surrounding scope's yield closure (not the implicit closure, since that would capture nearest block's block rather than outer method's block). Performance improves substantially with this change:
Previously, because we only had one yield instruction and one way to prepare any incoming block, the lowest-common-denominator was to always turn passed blocks into a Proc object before sending it on to a yield. This is less overhead than in 1.7, and resulted in our yield performance being worse than it should be. See #2442. In order to fix this I have made the following changes. * ReceiveClosure, the instruction which reifies an implicit block into a Proc, has been renamed to ReifyClosure. It is now used only for block args, as in def foo(&b). * LoadImplicitClosure and LoadFrameClosure are added to retrieve a block from either the JVM stack or from our frame stack. Method bodies and singleton class bodies use the stack, and all other scopes use frame. This allows us to load the block used for yielding in an efficient way regardless of which scope we are in. * IRScope temporarily aggregates two variables: one for the "yield" closure, the one used for yielding (which may come from either JVM stack or our frame); and one for the "implicit" closure, which is the one passed directly to this scope on the JVM stack. This additional closure is needed to represent two-block scopes like a block that receives a block. * ProcessModuleBody now has an operand to load a block passed to the module/class body. For normal module and class bodies this is always NullBlock (new operand for Block.NULL_BLOCK), but singleton class bodies (class << obj) inherit the surrounding scope's yield closure (not the implicit closure, since that would capture nearest block's block rather than outer method's block). This results in a substantial reduction of allocation, and a decent performance improvement: BEFORE yield 6.901M (±19.0%) i/s - 32.351M block.call 6.509M (±10.6%) i/s - 31.999M method dispatch 18.988M (±16.8%) i/s - 90.797M AFTER yield 8.002M (±22.4%) i/s - 36.745M block.call 6.399M (±12.8%) i/s - 31.314M method dispatch 18.939M (±10.5%) i/s - 92.585M
Ok, after a good day of working on the new yield logic, I think we're calling it done for now. The new logic is largely as described above:
All places that deal with receiving an incoming block and dealing with it now use this logic, which means we're no longer creating a Proc every time you pass a block to a method, and we're only creating one for &b if you actually consume it. Good stuff. Performance numbers with those changes aren't as big as we got from caching the block body, but they're still good: BEFORE
AFTER
Now we're getting back in the neighborhood of JRuby 1.7 again. A final allocation trace reveals that there's not much more we can eliminate:
Of these only a couple are worth further investigation:
We'll look into BacktraceElement first. |
The changes in 7e01a86 eliminate the need for us to clone a BacktraceElement object whenever we capture a binding. Instead of storing the object (which must be cloned so its moment-in-time line numbers are preserved), we just store its contents as normal fields on Binding. That makes Binding bigger, but we're allocating fewer objects. This gave a small improvement, from 8s to 8.8s on my machine. I continued on to RubyString being allocated for every Time instance, and it turns out it's totally unnecessary: it's just the 'TZ' key used to look up the current system timezone. We can eliminate this by caching the string in an internal variable of the Time object, avoiding the allocation altogether. |
Finally, we have fixed the last allocation item we easily can, with fixes to RubyTime in 2dd881b to avoid this wasted "TZ" string object. I did some analysis on the NoVarsDynamicScope, and I believe it is being allocated for the body of the benchmark, since there are nested scopes that need to be able to access each other. We will be doing analysis in the future to remove more of these temporary/useless scopes, but I'm going to consider that out of scope for this bug. With the final numbers, I think we can call this bug fixed. We're definitely a lot faster than MRI now, and additional iteration on yield/call perf will be coming with other optimizations post-release. |
Here's the bench and results: https://gist.github.com/headius/a302a0dd772d55f8213c
It seems the new runtime's logic for yield is slower than what we had in 1.7, perhaps because we've tried to unify all block invocation paths a bit better. There's probably a lot of low-hanging fruit in both yield and Proc#call paths, both in the interpreter and in the JIT.
The text was updated successfully, but these errors were encountered: