Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yield is slower than MRI in 9k #2442

Closed
headius opened this issue Jan 8, 2015 · 12 comments
Closed

yield is slower than MRI in 9k #2442

headius opened this issue Jan 8, 2015 · 12 comments

Comments

@headius
Copy link
Member

headius commented Jan 8, 2015

Here's the bench and results: https://gist.github.com/headius/a302a0dd772d55f8213c

It seems the new runtime's logic for yield is slower than what we had in 1.7, perhaps because we've tried to unify all block invocation paths a bit better. There's probably a lot of low-hanging fruit in both yield and Proc#call paths, both in the interpreter and in the JIT.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

Analyzing the results a bit more...

9k performs nearly identically for both yield and call, which leads me to think it's the same thing slowing both of them down. call is a bit slower than yield in 1.7 but much of that logic has been cleaned up. I suspect a fix for one will be a fix for both in 9k.

I started by running a sampled JVM-level profile (with --sample ) of just the yield benchmark in both interpreted and jitted+indy modes.

Interpreter results were not helpful, due to most of the overhead showing up in the main interpreter loop (as can happen with sampling).

Compiled results, however, were more interesting:

 28.9%   476  +     2    org.jruby.runtime.CompiledIRBlockBody.<init>
 22.9%   378  +     0    java.lang.invoke.LambdaForm$DMH.invokeStatic_L7_L
  5.4%    83  +     7    Users.headius.projects.jruby.lib.ruby.gems.shared.gems.benchmark_minus_ips_minus_2_dot_1_dot_0.lib.benchmark.ips.job.__script__
  4.2%    68  +     2    org.jruby.runtime.CompiledIRBlockBody.commonYieldPath
  3.5%    56  +     2    org.jruby.RubyTime$1.allocate

The only way there could be that many samples inside CompiledIRBlockBody. is if we're creating it anew for every instantiation of a block. And that's not surprising, since I never implemented compiled block body caching in either JIT mode. Let's start with that.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

Heh:

{ // FIXME: block body should be cached

@headius
Copy link
Member Author

headius commented Jan 8, 2015

Looks like that was the lion's share of the performance hit:

$ jruby -Xcompile.invokedynamic=true bench_yield.rb 
Calculating -------------------------------------
               yield   111.603k i/100ms
          block.call   137.092k i/100ms
     method dispatch   165.031k i/100ms
-------------------------------------------------
               yield      6.484M (±11.8%) i/s -     31.807M
          block.call      6.309M (±11.5%) i/s -     31.120M
     method dispatch     17.259M (±16.4%) i/s -     83.506M

However yield performance still isn't back up to where it was in 1.7 (40-50% fewer ips). The new sampled profile doesn't give us much to go on, though:

28.4%   550  +     0    java.lang.invoke.LambdaForm$DMH.invokeStatic_L7_L
 11.3%   213  +     5    org.jruby.runtime.CompiledIRBlockBody.commonYieldPath
 10.4%   194  +     8    bench_yield.\=bench_yield\,rb_CLOSURE_1_CLOSURE_2__bench_yield\,rb_CLOSURE_1_1
 10.4%   196  +     5    bench_yield.\=bench_yield\,rb_CLOSURE_1_CLOSURE_4__bench_yield\,rb_CLOSURE_1_3
  7.3%   138  +     4    Users.headius.projects.jruby.lib.ruby.gems.shared.gems.benchmark_minus_ips_minus_2_dot_1_dot_0.lib.benchmark.ips.job.__script__

In general the bodies of the jitted Ruby code are not useful in a sampled profile because they'll inline a lot of irreducible overhead and do all the routing of data from call to call. The only interesting line here is CompiledIRBlockBody.commonYieldPath, code run for all calls to JIT-compiled blocks. Unfortunately in a sampled profile, this is may just be normal runtime overhead, so we need to go deeper.

Next thing to try will be an allocation profile.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

FWIW LambdaForm in the previous profile is probably not useful because it's irreducible overhead of the JVM's invokedynamic plumbing. Might be worth pointing out to Hotspot devs, but they already know LambdaForms show up in our profiles a lot.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

For allocation profiling, I use the builtin "hprof" extension for the JVM, using the following command line:

jruby -J-Xrunhprof:depth=10 ...

Depth here indicates how many levels of stack should be use to uniquely identify allocation traces.

Running this against the full benchmark (not just the yield portion, since we'd like to improve Proc#call as well), yields the following top allocations:

          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1 24.26% 24.26%  27530800 688270  27530800 688270 409098 org.jruby.RubyFixnum
    2  5.55% 29.81%   6300512 196891   6300832 196901 368913 org.jruby.runtime.scope.DummyDynamicScope
    3  5.54% 35.35%   6290368 196574   6290368 196574 407985 org.jruby.runtime.scope.NoVarsDynamicScope
    4  4.22% 39.57%   4788144 99753   4788144 99753 408463 org.jruby.runtime.Binding
    5  4.11% 43.68%   4662240 97130   4662240 97130 406489 org.jruby.runtime.Binding
    6  3.56% 47.23%   4036512 84094   4036512 84094 406657 org.jruby.RubyTime
    7  3.52% 50.75%   3990120 99753   3990120 99753 408464 org.jruby.runtime.Block
    8  3.52% 54.27%   3990080 99752   3990080 99752 408700 org.jruby.RubyProc
    9  3.42% 57.69%   3885160 97129   3885160 97129 406673 org.jruby.RubyProc
   10  3.41% 61.10%   3872840 96821   3872840 96821 407986 org.jruby.runtime.Block
   11  2.96% 64.07%   3364120 84103   3364160 84104 358603 org.jruby.RubyString
   12  2.89% 66.95%   3277840 81946   3277840 81946 406822 org.jruby.RubyFixnum
   13  2.11% 69.06%   2394072 99753   2394072 99753 408462 org.jruby.runtime.backtrace.BacktraceElement
   14  2.05% 71.12%   2331120 97130   2331120 97130 406488 org.jruby.runtime.backtrace.BacktraceElement
   15  1.78% 72.90%   2018256 84094   2018256 84094 406656 org.joda.time.DateTime
   16  0.56% 73.46%    639552 12092    639552 12092 300000 char[]
   17  0.45% 73.91%    514152 9980    514152  9980 300000 java.lang.Object[]
   18  0.45% 74.36%    508896   62    508896    62 379976 byte[]

From this last entry down are mostly one-time-only allocations to support booting JRuby and making the script runnable. As for the top entries, we'll go down the list...

The Fixnum allocations here are likely from benchmark/ips itself, counting iterations and such. The allocation trace also shows we're using the specialized Fixnum operator logic for a + 1, so there's probably not much we can do to change this.

TRACE 409098:
        org.jruby.RubyBasicObject.<init>(RubyBasicObject.java:245)
        org.jruby.RubyObject.<init>(RubyObject.java:100)
        org.jruby.RubyNumeric.<init>(RubyNumeric.java:110)
        org.jruby.RubyInteger.<init>(RubyInteger.java:88)
        org.jruby.RubyFixnum.<init>(RubyFixnum.java:119)
        org.jruby.RubyFixnum.newFixnum(RubyFixnum.java:211)
        org.jruby.RubyFixnum.op_plus_one(RubyFixnum.java:402)
        org.jruby.runtime.invokedynamic.MathLinker.fixnum_op_plus_one(MathLinker.java:273)
        java.lang.invoke.LambdaForm$DMH.325040804.invokeStatic_L3_L(LambdaForm$DMH:Unknown line)
        java.lang.invoke.LambdaForm$MH.2055281021.delegate(LambdaForm$MH:Unknown line)

The next two items represent dynamic scopes where we store variables that must be visible on the heap (i.e. visible to multiple lexical bodies of code). Dummy scopes are used when a heap reference must be maintained to a lexically-higher scope but none of the referer's variables need to be on the heap. The dummy scopes being allocated show something very interesting:

TRACE 368913:
        org.jruby.runtime.DynamicScope.<init>(DynamicScope.java:57)
        org.jruby.runtime.scope.NoVarsDynamicScope.<init>(NoVarsDynamicScope.java:16)
        org.jruby.runtime.scope.DummyDynamicScope.<init>(DummyDynamicScope.java:19)
        org.jruby.runtime.DynamicScope.newDummyScope(DynamicScope.java:91)
        org.jruby.runtime.Binding.getDummyScope(Binding.java:162)
        org.jruby.RubyProc.setup(RubyProc.java:182)
        org.jruby.RubyProc.newProc(RubyProc.java:110)
        org.jruby.RubyProc.newProc(RubyProc.java:105)
        org.jruby.Ruby.newProc(Ruby.java:3398)
        org.jruby.ir.runtime.IRRuntimeHelpers.newProc(IRRuntimeHelpers.java:435)

Proc construction is triggering the allocation of a dummy scope at initialization time, rather than on-demand. This should only affect Proc#call performance, since normally a non-"reified" block never needs to become a full Proc. We can confirm that by running just the yield benchmark under allocation profiling.

          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1  8.35%  8.35%   4530672 94389   4530672 94389 406044 org.jruby.runtime.Binding
    2  6.96% 15.31%   3775520 94388   3775520 94388 406228 org.jruby.RubyProc
    3  6.93% 22.24%   3763240 94081   3763240 94081 407541 org.jruby.runtime.Block
    4  5.57% 27.81%   3020640 94395   3020960 94405 368898 org.jruby.runtime.scope.DummyDynamicScope
    5  5.55% 33.35%   3010592 94081   3010592 94081 407540 org.jruby.runtime.scope.NoVarsDynamicScope
    6  4.17% 37.53%   2265336 94389   2265336 94389 406043 org.jruby.runtime.backtrace.BacktraceElement
    7  3.82% 41.35%   2072400 51810   2072400 51810 408020 org.jruby.RubyFixnum
    8  1.31% 42.66%    713136 14857    713136 14857 406212 org.jruby.RubyTime
    9  1.17% 43.84%    636624 12064    636624 12064 300000 char[]
   10  1.10% 44.93%    594640 14866    594680 14867 358598 org.jruby.RubyString
   11  1.05% 45.98%    570960 14274    570960 14274 406377 org.jruby.RubyFixnum

In this profile, Binding is likely irreducible, since we will have to allocate a binding for every first call to the block. However, if we're allocating one for the n+1 calls too, we may be able to reduce that overhead by reusing the binding. This is a bit trickier and requires some cooperation with the interpreter and the whole IR runtime, so we'll skip that line for now.

The dummy scopes are essentially unchanged. This is worth looking into.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

In the above profile, RubyProc is showing up as being allocated about as many times as Binding. Since we're measuring yield for that run, it indicates we're creating a Proc object even for cases that should not need to become a full Proc. The trace for these allocations shows them happening at "unknown line" in our original script, which likely means they're from either the literal blocks in the benchmark code or the methods that receive them.

A deeper allocation trace, with depth=20, shows where in the benchmark triggers this proc creation.

TRACE 591559:
        org.jruby.RubyBasicObject.<init>(RubyBasicObject.java:236)
        org.jruby.RubyObject.<init>(RubyObject.java:93)
        org.jruby.RubyProc.<init>(RubyProc.java:67)
        org.jruby.RubyProc.<init>(RubyProc.java:73)
        org.jruby.RubyProc.newProc(RubyProc.java:109)
        org.jruby.RubyProc.newProc(RubyProc.java:105)
        org.jruby.Ruby.newProc(Ruby.java:3398)
        org.jruby.ir.runtime.IRRuntimeHelpers.newProc(IRRuntimeHelpers.java:435)
        bench2.call_5(bench2.rb:Unknown line)
        java.lang.invoke.LambdaForm$DMH.925858445.invokeStatic_L5_L(LambdaForm$DMH:Unknown line)
        java.lang.invoke.LambdaForm$BMH.361571968.reinvoke(LambdaForm$BMH:Unknown line)
        java.lang.invoke.LambdaForm$MH.1159190947.delegate(LambdaForm$MH:Unknown line)
        java.lang.invoke.LambdaForm$MH.921760190.guard(LambdaForm$MH:Unknown line)
        java.lang.invoke.LambdaForm$MH.1159190947.delegate(LambdaForm$MH:Unknown line)
        java.lang.invoke.LambdaForm$MH.921760190.guard(LambdaForm$MH:Unknown line)
        java.lang.invoke.LambdaForm$MH.1663619914.linkToCallSite(LambdaForm$MH:Unknown line)
        bench2.\=bench2\,rb_CLOSURE_1_CLOSURE_2__bench2\,rb_CLOSURE_1_1(bench2.rb:26)
        java.lang.invoke.LambdaForm$DMH.804564176.invokeStatic_L7_L(LambdaForm$DMH:Unknown line)
        java.lang.invoke.LambdaForm$MH.454325163.invokeExact_MT(LambdaForm$MH:Unknown line)
        org.jruby.runtime.CompiledIRBlockBody.commonYieldPath(CompiledIRBlockBody.java:66)

Line 26 of the modified benchmark, in context:

    24  Benchmark.ips do |ips|
    25    ips.report("yield") do
    26      a.() {}
    27    end
    28  end

So the literal block is triggering a full RubyProc object every time. This is excessive, so we'll explore it.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

The LambdaForm lines in between our script line (bench2.rb:26) indicate that the Proc object is not being created directly by the script body, but by an invokedynamic call it makes. We need to look at bytecode.

We can dump bytecode for a snippit of Ruby code by passing --bytecode to JRuby. Here's a command line and the relevant bits of the output for a simple literal block:

$ jruby -Xcompile.invokedynamic=true --bytecode -e "x { }"
...
    INVOKEVIRTUAL org/jruby/runtime/ThreadContext.currentBinding (Lorg/jruby/runtime/builtin/IRubyObject;Lorg/jruby/runtime/DynamicScope;)Lorg/jruby/runtime/Binding;
    INVOKESPECIAL org/jruby/runtime/Block.<init> (Lorg/jruby/runtime/BlockBody;Lorg/jruby/runtime/Binding;)V
    INVOKESTATIC org/jruby/ir/runtime/IRRuntimeHelpers.getBlockFromObject (Lorg/jruby/runtime/ThreadContext;Ljava/lang/Object;)Lorg/jruby/runtime/Block;
    INVOKEDYNAMIC invokeSelf:x(Lorg/jruby/runtime/ThreadContext;Lorg/jruby/runtime/builtin/IRubyObject;Lorg/jruby/runtime/builtin/IRubyObject;Lorg/jruby/runtime/Block;)Lorg/jruby/runtime/builtin/IRubyObject; [
      // handle kind 0x6 : INVOKESTATIC
      org/jruby/ir/targets/SelfInvokeSite.bootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
      // arguments: none
    ]
...

The only invokedynamic that shows up in this small snippit of code is the call to x. That call leads to a "call_5" method in the same script, which in this case is the "call" method the benchmark defines to yield back to the block. The line numbers being missing here indicate that this is likely happening before we reach any logic for the actual yield...preamble code for the block entering the method. Another bytecode dump shows the contents of a simple yielding method.

$ jruby -Xcompile.invokedynamic=true --bytecode -e "def x; yield; end"
$ jruby -Xcompile.invokedynamic=true --bytecode -e "def x; yield; end"
...
  public static x_0(Lorg/jruby/runtime/ThreadContext;Lorg/jruby/parser/StaticScope;Lorg/jruby/runtime/builtin/IRubyObject;Lorg/jruby/runtime/Block;Lorg/jruby/RubyModule;)Lorg/jruby/runtime/builtin/IRubyObject;
   L0
    NOP
    ALOAD 0
    INVOKEDYNAMIC nil(Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/runtime/builtin/IRubyObject; [
      // handle kind 0x6 : INVOKESTATIC
      org/jruby/ir/targets/Bootstrap.contextValue(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
      // arguments: none
    ]
    ASTORE 6
    ALOAD 6
    ASTORE 7
    NOP
    ALOAD 0
    INVOKEDYNAMIC runtime(Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby; [
      // handle kind 0x6 : INVOKESTATIC
      org/jruby/ir/targets/Bootstrap.contextValue(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
      // arguments: none
    ]
    ALOAD 3
    INVOKESTATIC org/jruby/ir/runtime/IRRuntimeHelpers.newProc (Lorg/jruby/Ruby;Lorg/jruby/runtime/Block;)Lorg/jruby/runtime/builtin/IRubyObject;
    ASTORE 7
    ALOAD 0
    INVOKEDYNAMIC checkpoint(Lorg/jruby/runtime/ThreadContext;)V [
      // handle kind 0x6 : INVOKESTATIC
      org/jruby/runtime/invokedynamic/InvokeDynamicSupport.checkpointBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
      // arguments: none
    ]
   L1
    LINENUMBER 1 L1
    ALOAD 0
    ALOAD 7
    INVOKESTATIC org/jruby/ir/runtime/IRRuntimeHelpers.yieldSpecific (Lorg/jruby/runtime/ThreadContext;Ljava/lang/Object;)Lorg/jruby/runtime/builtin/IRubyObject;
...

The last line here is our yield, and it appears after LINENUMBER 1 L1, so we know it's part of the actual Ruby code body. Our newProc call shows up here, loading the raw block from a local variable at index 3 (the 4th argument to the method) and turning it into a proc before using it to yield.

The yieldSpecific method just unwraps any RubyProc object it receives, so creating that extra layer is obviously unnecessary here. I'll look into that next.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

In order to see why we're creating this proc, we need to go to the next level up: IR instructions. We can do this using -Xir.compiler.debug=true which will dump out reams of output for all optimization stages of all Ruby code it loads. For a simple snippit of code, our code should be displayed last.

$ jruby -Xir.compiler.debug=true -e "def x; yield; end"
...
Linearized instructions for JIT:
  0 %v_0 = nil
  1 %t_%block_1 = copy(%v_0)
  2 check_arity req: 0, opt: 0, *r: -1, kw: false, **r: -1
  3 %t_%block_1 = recv_closure
  4 thread_poll
  5 line_num(0)
  6 %v_0 = yield(%t_%block_1, UNWRAP(%undefined))
  7 return(%v_0)

The thread_poll instruction corresponds to the checkpoint invokedynamic call, and the line_num instruction ends up as the LINENUMBER directive in the bytecode. So the code immediately after turns into yieldSpecific, and the code immediately preceding causes our RubyProc construction: recv_closure.

The logic for ReceiveClosureInstr (recv_closure) lives in the interpreter, since it's a heavily-hit runtime instruction. Interpreter.receiveArg leads us to this code:

        case RECV_CLOSURE:
            result = IRRuntimeHelpers.newProc(context.runtime, block);
            setResult(temp, currDynScope, instr.getResult(), result);
            return;

So this instruction is currently expected to create a Proc object for every passed-in block, literal or otherwise. In order to eliminate this extra overhead, we must make sure that all recv_closure consumers are happy with a plain block instead of a proc.

@headius
Copy link
Member Author

headius commented Jan 8, 2015

After some discussion with @subbuss and @enebo, we realized that in order to eliminate this proc creation we need to make the IR understand the implicit block argument passed into every body of code. This necessitated creating two new instructions:

  • LoadImplicitClosureInstr, for loading the block passed along the JVM stack to method bodies.
  • LoadFrameClosureInstr, for loading the block stored on-heap in our frame stack (for closure and metaclass bodies).

The recv_closure instruction then is only used for reifying an implicit block into a Proc object as in def foo(&b), so we'll rename that to reify_closure and reduce its use.

By making the IR know about this implicit block variable (either passed on JVM stack or stored in our artificial frame), we never have to instantiate a Proc unless you're actually going to need a Proc. That's a good thing.

headius added a commit that referenced this issue Jan 9, 2015
Previously, because we only had one yield instruction and one
way to prepare any incoming block, the lowest-common-denominator
was to always turn passed blocks into a Proc object before sending
it on to a yield. This is less overhead than in 1.7, and resulted
in our yield performance being worse than it should be. See #2442.

In order to fix this I have made the following changes.

* ReceiveClosure, the instruction which reifies an implicit block
  into a Proc, has been renamed to ReifyClosure. It is now used
  only for block args, as in def foo(&b).
* LoadImplicitClosure and LoadFrameClosure are added to retrieve
  a block from either the JVM stack or from our frame stack.
  Method bodies and singleton class bodies use the stack, and all
  other scopes use frame. This allows us to load the block used
  for yielding in an efficient way regardless of which scope we
  are in.
* IRScope temporarily aggregates two variables: one for the
  "yield" closure, the one used for yielding (which may come from
  either JVM stack or our frame); and one for the "implicit"
  closure, which is the one passed directly to this scope on the
  JVM stack. This additional closure is needed to represent
  two-block scopes like a block that receives a block.
* ProcessModuleBody now has an operand to load a block passed to
  the module/class body. For normal module and class bodies this
  is always NullBlock (new operand for Block.NULL_BLOCK), but
  singleton class bodies (class << obj) inherit the surrounding
  scope's yield closure (not the implicit closure, since that
  would capture nearest block's block rather than outer method's
  block).

Performance improves substantially with this change:
headius added a commit that referenced this issue Jan 9, 2015
Previously, because we only had one yield instruction and one
way to prepare any incoming block, the lowest-common-denominator
was to always turn passed blocks into a Proc object before sending
it on to a yield. This is less overhead than in 1.7, and resulted
in our yield performance being worse than it should be. See #2442.

In order to fix this I have made the following changes.

* ReceiveClosure, the instruction which reifies an implicit block
  into a Proc, has been renamed to ReifyClosure. It is now used
  only for block args, as in def foo(&b).
* LoadImplicitClosure and LoadFrameClosure are added to retrieve
  a block from either the JVM stack or from our frame stack.
  Method bodies and singleton class bodies use the stack, and all
  other scopes use frame. This allows us to load the block used
  for yielding in an efficient way regardless of which scope we
  are in.
* IRScope temporarily aggregates two variables: one for the
  "yield" closure, the one used for yielding (which may come from
  either JVM stack or our frame); and one for the "implicit"
  closure, which is the one passed directly to this scope on the
  JVM stack. This additional closure is needed to represent
  two-block scopes like a block that receives a block.
* ProcessModuleBody now has an operand to load a block passed to
  the module/class body. For normal module and class bodies this
  is always NullBlock (new operand for Block.NULL_BLOCK), but
  singleton class bodies (class << obj) inherit the surrounding
  scope's yield closure (not the implicit closure, since that
  would capture nearest block's block rather than outer method's
  block).

This results in a substantial reduction of allocation, and a
decent performance improvement:

BEFORE

               yield      6.901M (±19.0%) i/s -     32.351M
          block.call      6.509M (±10.6%) i/s -     31.999M
     method dispatch     18.988M (±16.8%) i/s -     90.797M

AFTER

               yield      8.002M (±22.4%) i/s -     36.745M
          block.call      6.399M (±12.8%) i/s -     31.314M
     method dispatch     18.939M (±10.5%) i/s -     92.585M
@headius
Copy link
Member Author

headius commented Jan 9, 2015

Ok, after a good day of working on the new yield logic, I think we're calling it done for now. The new logic is largely as described above:

  • Each scope has two possible places to get a block from.
    ** The literal block passed directly to a given scope
    ** The yieldable block passed via some other structure (like a call frame)

All places that deal with receiving an incoming block and dealing with it now use this logic, which means we're no longer creating a Proc every time you pass a block to a method, and we're only creating one for &b if you actually consume it. Good stuff.

Performance numbers with those changes aren't as big as we got from caching the block body, but they're still good:

BEFORE

               yield      6.901M (±19.0%) i/s -     32.351M
          block.call      6.509M (±10.6%) i/s -     31.999M
     method dispatch     18.988M (±16.8%) i/s -     90.797M

AFTER

               yield      8.002M (±22.4%) i/s -     36.745M
          block.call      6.399M (±12.8%) i/s -     31.314M
     method dispatch     18.939M (±10.5%) i/s -     92.585M

Now we're getting back in the neighborhood of JRuby 1.7 again.

A final allocation trace reveals that there's not much more we can eliminate:

          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1  9.54%  9.54%   4347264 90568   4347264 90568 596407 org.jruby.runtime.Binding
    2  7.95% 17.48%   3622720 90568   3622720 90568 596408 org.jruby.runtime.Block
    3  6.36% 23.84%   2898176 90568   2898176 90568 596405 org.jruby.runtime.scope.NoVarsDynamicScope
    4  4.77% 28.61%   2173632 90568   2173632 90568 596406 org.jruby.runtime.backtrace.BacktraceElement
    5  3.80% 32.41%   1733160 43329   1733160 43329 598437 org.jruby.RubyFixnum
    6  1.39% 33.80%    631360 11990    631360 11990 300000 char[]
    7  1.22% 35.02%    556416 11592    556416 11592 590794 org.jruby.RubyTime
    8  1.13% 36.14%    513728 9963    513728  9963 300000 java.lang.Object[]
    9  1.03% 37.17%    467856   57    467856    57 476224 byte[]
   10  1.02% 38.19%    463680 11592    463680 11592 590792 org.jruby.RubyString
   11  0.96% 39.14%    436400 10910    436400 10910 591259 org.jruby.RubyFixnum
   12  0.63% 39.78%    287920 11975    287920 11975 300000 java.lang.String
   13  0.61% 40.39%    278208 11592    278208 11592 590793 org.joda.time.DateTime

Of these only a couple are worth further investigation:

  • DateTime, RubyString, and RubyTime are being created together, as evidenced by the identical allocation counts. RubyString is worth looking into, since we shouldn't need a String inside a Time object.
  • The RubyFixnum objects are being allocated within the benchmark library. If we pass -Xjit.threshold=0 we can force those methods to JIT sooner and show up in the allocation trace. A good portion of the benchmark's overhead is the library itself.
  • BacktraceElement is used to represent a backtrace line for a body of code that JRuby is interpreting. Bodies that compile down to bytecode don't use BacktraceElement, so something's odd here.
  • NoVarsDynamicScope, Block, and Binding are all allocated to represent the binding context of the benchmark's block. We may be able to eliminate the scope, since it's obviously not needed in such a simple case.

We'll look into BacktraceElement first.

@headius
Copy link
Member Author

headius commented Jan 9, 2015

The changes in 7e01a86 eliminate the need for us to clone a BacktraceElement object whenever we capture a binding. Instead of storing the object (which must be cloned so its moment-in-time line numbers are preserved), we just store its contents as normal fields on Binding. That makes Binding bigger, but we're allocating fewer objects.

This gave a small improvement, from 8s to 8.8s on my machine.

I continued on to RubyString being allocated for every Time instance, and it turns out it's totally unnecessary: it's just the 'TZ' key used to look up the current system timezone.

We can eliminate this by caching the string in an internal variable of the Time object, avoiding the allocation altogether.

@headius
Copy link
Member Author

headius commented Jan 12, 2015

Finally, we have fixed the last allocation item we easily can, with fixes to RubyTime in 2dd881b to avoid this wasted "TZ" string object.

I did some analysis on the NoVarsDynamicScope, and I believe it is being allocated for the body of the benchmark, since there are nested scopes that need to be able to access each other. We will be doing analysis in the future to remove more of these temporary/useless scopes, but I'm going to consider that out of scope for this bug.

With the final numbers, I think we can call this bug fixed. We're definitely a lot faster than MRI now, and additional iteration on yield/call perf will be coming with other optimizations post-release.

@headius headius closed this as completed Jan 12, 2015
@headius headius added this to the JRuby 9.0.0.0-pre1 milestone Jan 12, 2015
@headius headius self-assigned this Jan 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant