Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast-ruby benchmark comparing #cover to #range is faster on MRI #3841

Closed
rjnienaber opened this issue May 3, 2016 · 13 comments
Closed

fast-ruby benchmark comparing #cover to #range is faster on MRI #3841

rjnienaber opened this issue May 3, 2016 · 13 comments

Comments

@rjnienaber
Copy link

I ran all the benchmarks from the fast-ruby project and the benchmark that compares #cover to #range seems to be faster on MRI versus JRuby.

Environment

$ uname -a
Darwin ldn131mac.local 14.5.0 Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64
$ gem list

*** LOCAL GEMS ***

benchmark-ips (2.3.0)
bundler (1.12.1)
rake (10.4.2)

Expected Behavior

$ rvm ruby-2.3.1 do bundle exec ruby -v code/range/cover-vs-include.rb
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin14]
Calculating -------------------------------------
        range#cover?   102.770k i/100ms
      range#include?     8.913k i/100ms
       range#member?     9.004k i/100ms
       plain compare   118.650k i/100ms
-------------------------------------------------
        range#cover?      2.590M (± 4.3%) i/s -     12.949M
      range#include?     94.705k (± 8.9%) i/s -    472.389k
       range#member?     98.532k (± 3.0%) i/s -    495.220k
       plain compare      3.482M (± 5.0%) i/s -     17.442M

Comparison:
       plain compare:  3482364.0 i/s
        range#cover?:  2590352.2 i/s - 1.34x slower
       range#member?:    98532.1 i/s - 35.34x slower
      range#include?:    94705.2 i/s - 36.77x slower

Actual Behavior

$ rvm jruby-9.0.5.0 do bundle exec ruby -v code/range/cover-vs-include.rb
jruby 9.0.5.0 (2.2.3) 2016-01-26 7bee00d Java HotSpot(TM) 64-Bit Server VM 25.51-b03 on 1.8.0_51-b16 +jit [darwin-x86_64]
Calculating -------------------------------------
        range#cover?    47.056k i/100ms
      range#include?     3.079k i/100ms
       range#member?     3.858k i/100ms
       plain compare    75.894k i/100ms
-------------------------------------------------
        range#cover?    967.179k (± 4.6%) i/s -      4.847M
      range#include?     41.754k (± 3.9%) i/s -    209.372k
       range#member?     42.324k (± 3.0%) i/s -    212.190k
       plain compare      1.384M (± 3.4%) i/s -      6.906M

Comparison:
       plain compare:  1384071.4 i/s
        range#cover?:   967179.2 i/s - 1.43x slower
       range#member?:    42323.9 i/s - 32.70x slower
      range#include?:    41754.5 i/s - 33.15x slower
@chrisseaton
Copy link
Contributor

Wow isn't it fascinating though that the methods are almost the same ratio compared to each other, on both implementations, even though JRuby is slower overall.

@headius
Copy link
Member

headius commented May 3, 2016

How peculiar! I know that MRI's Range class has had a lot of optimization hackery over the past several years, so we probably still have inefficient versions of these methods. I'll have a looksee.

@headius
Copy link
Member

headius commented May 3, 2016

JRuby baseline without indy (indy has a bug that crashes include? in #3842).

Warming up --------------------------------------
        range#cover?    52.262k i/100ms
      range#include?     3.172k i/100ms
       range#member?     3.933k i/100ms
       plain compare    75.148k i/100ms
Calculating -------------------------------------
        range#cover?      1.199M (± 7.8%) i/s -      5.958M in   5.002106s
      range#include?     43.908k (± 6.7%) i/s -    218.868k in   5.009148s
       range#member?     43.427k (± 5.7%) i/s -    216.315k in   4.998438s
       plain compare      1.733M (± 6.2%) i/s -      8.642M in   5.006489s

MRI baseline

Warming up --------------------------------------
        range#cover?    86.262k i/100ms
      range#include?     6.360k i/100ms
       range#member?     6.420k i/100ms
       plain compare   108.916k i/100ms
Calculating -------------------------------------
        range#cover?      1.942M (± 6.6%) i/s -      9.748M in   5.043134s
      range#include?     68.182k (± 6.5%) i/s -    343.440k in   5.064089s
       range#member?     69.317k (± 4.0%) i/s -    346.680k in   5.009369s
       plain compare      2.682M (± 6.2%) i/s -     13.397M in   5.017090s

Ok, so a few immediate observations about the benchmark:

  • The cost of constructing the range is included in every iteration.

This is a rather large component of the overall time for at least the cover? benchmark. Here's the cover? benchmark without the range construction included on both MRI and JRuby:

JRuby:
range#cover?      1.722M (± 6.0%) i/s -      8.616M in   5.023941s

MRI:
range#cover?      3.253M (± 6.7%) i/s -     16.311M in   5.038503s

So about 50% better results in both cases. JRuby's still behind, though.

  • Range#include? in JRuby requires a frame, so it can call super.

This is very likely where most of the performance is going for at least include?.

  • In both JRuby and MRI, include? and member? are the same method, and will have the same performance characteristics.

Fixing one will fix both.

@headius
Copy link
Member

headius commented May 3, 2016

Another observation:

  • JRuby has traditionally not handled block dispatch as fast as MRI. The benchmark's dispatch of the report block skews the measurement of these methods.

This probably won't change things a great deal for the include? bench, but I'm sure the `cover? bench would produce better numbers with local iterations (a numeric loop) rather than block iterations.

@headius
Copy link
Member

headius commented May 3, 2016

A sampled profile (--sample) shows some peculiar results:

     Compiled + native   Method                        
 16.5%   325  +     2    org.jruby.java.invokers.InstanceMethodInvoker.call
 14.1%   279  +     0    Users.headius.projects.jruby.lib.ruby.stdlib.date.RUBY$method$\=\^=\_$0
 11.7%   231  +     0    org.jruby.RubyClass.finvoke
 10.0%   198  +     0    org.jruby.ir.runtime.IRRuntimeHelpers.isEQQ
  6.3%   125  +     0    org.jruby.internal.runtime.methods.AttrReaderMethod.call
  5.7%   113  +     0    org.jruby.RubyRange$INVOKER$i$1$0$cover_p.call

At first I was baffled to see InstanceMethodInvoker here because that's used for Java integration. Then I noticed the second result was a jitted Ruby method in Date.

I think a big part of why the cover? benchmark is slower in JRuby is that we implement <=> in Ruby, and MRI implements it in C. So JRuby's numbers here are comparing – at least in part – JRuby running pure Ruby code versus MRI C code.

Of course we'd like to run Ruby code as fast as C, but we're not quite there yet :-)

I will see if anything can be done with Date#<=> to improve perf other than rewriting it in Java.

@headius
Copy link
Member

headius commented May 3, 2016

Ok, so here's a baseline for the cover? bench with invokedynamic, which is my usual perf mode:

Warming up --------------------------------------
        range#cover?    71.816k i/100ms
Calculating -------------------------------------
        range#cover?      2.079M (± 7.5%) i/s -     10.342M in   5.005975s

So we're about 6s off MRI.

I made the following changes:

  • Override nonzero_p in RubyFixnum so it doesn't redispatch to zero? in RubyNumeric. (risk: user can't override Fixnum#zero? and have it be reflected in Fixnum#nonzero?)
  • Refactor Date#<=> to not use a case/when and rare cases moved to a separate method. (risk: need to namespace secondary method better)
  • Provide a non-overloaded call path for org.joda.time.DateTime.compareTo, used in our Date#<=> impl. (risk: none)

With those changes, here's updated indy numbers:

Warming up --------------------------------------
        range#cover?    85.554k i/100ms
Calculating -------------------------------------
        range#cover?      2.901M (± 9.5%) i/s -     14.373M in   5.005599s
Warming up --------------------------------------
        range#cover?   101.780k i/100ms
Calculating -------------------------------------
        range#cover?      2.959M (± 8.0%) i/s -     14.758M in   5.024526s

I showed a second iteration because it was a fair improvement over the first. With these numbers, we're 2/3 of the way to MRI without rewriting Date#<=> in Java!

@headius
Copy link
Member

headius commented May 3, 2016

Additional discovery: We are still not using invokedynamic to bind Java methods, so they go through more machinery and do not optimize/inline as well as they should. This affects the DateTime#compareTo call in the code below:

  def <=> (other)
    case other
    when Numeric
      ajd <=> other
    when Date
      # The method compareTo doesn't compare the sub milliseconds so after compare the two dates
      # then we have to compare the sub milliseconds to make sure that both are exactly equal.
      @dt.compareTo(other.dt).nonzero? || @sub_millis <=> other.sub_millis
    else
      begin
        l, r = other.coerce(self)
        l <=> r
      rescue NoMethodError
        nil
      end
    end
  end

So some portion of the remaining gap with MRI is in that Java call not optimizing.

Here also you can see the case/when and "unusual" cases that I factored out. Here's the modified version:

  class org::joda::time::DateTime
    java_alias :compareDT, :compareTo, [org.joda.time.ReadableInstant]
  end
  def <=> (other)
    if other.kind_of?(Date)
      # The method compareTo doesn't compare the sub milliseconds so after compare the two dates
      # then we have to compare the sub milliseconds to make sure that both are exactly equal.
      @dt.compareDT(other.dt).nonzero? || @sub_millis <=> other.sub_millis
    else
       __internal_cmp(other)
    end
  end

  private def __internal_cmp(other)
    if other.kind_of? Numeric
      ajd <=> other
    else
      begin
        l, r = other.coerce(self)
        l <=> r
      rescue NoMethodError
        nil
      end
    end
  end

This avoids case/when (because === requires a frame) and prioritizes what I believe to be the most common case, comparing two Date instances.

Here's the full diff:

diff --git a/core/src/main/java/org/jruby/RubyFixnum.java b/core/src/main/java/org/jruby/RubyFixnum.java
index 202a047..f71470c 100644
--- a/core/src/main/java/org/jruby/RubyFixnum.java
+++ b/core/src/main/java/org/jruby/RubyFixnum.java
@@ -1224,6 +1224,11 @@ public class RubyFixnum extends RubyInteger implements Constantizable {
         return input.getRuntime().newFixnum(input.unmarshalInt());
     }

+    @Override
+    public IRubyObject nonzero_p(ThreadContext context) {
+        return value == 0 ? context.nil : this;
+    }
+
     /*  ================
      *  Singleton Methods
      *  ================
diff --git a/lib/ruby/stdlib/date.rb b/lib/ruby/stdlib/date.rb
index 75cc0c6..ad36703 100644
--- a/lib/ruby/stdlib/date.rb
+++ b/lib/ruby/stdlib/date.rb
@@ -1402,14 +1402,22 @@ class Date
   # two DateTime instances.  When comparing a DateTime instance
   # with a Date instance, the time of the latter will be
   # considered as falling on midnight UTC.
+  class org::joda::time::DateTime
+    java_alias :compareDT, :compareTo, [org.joda.time.ReadableInstant]
+  end
   def <=> (other)
-    case other
-    when Numeric
-      ajd <=> other
-    when Date
+    if other.kind_of?(Date)
       # The method compareTo doesn't compare the sub milliseconds so after compare the two dates
       # then we have to compare the sub milliseconds to make sure that both are exactly equal.
-      @dt.compareTo(other.dt).nonzero? || @sub_millis <=> other.sub_millis
+      @dt.compareDT(other.dt).nonzero? || @sub_millis <=> other.sub_millis
+    else
+       __internal_cmp(other)
+    end
+  end
+
+  private def __internal_cmp(other)
+    if other.kind_of? Numeric
+      ajd <=> other
     else
       begin
         l, r = other.coerce(self)

@headius
Copy link
Member

headius commented May 3, 2016

With a temporary fix in place for #3842, here's updated numbers (including all above patches):

Warming up --------------------------------------
        range#cover?    84.511k i/100ms
      range#include?     3.307k i/100ms
       plain compare   101.362k i/100ms
Calculating -------------------------------------
        range#cover?      2.634M (± 9.8%) i/s -     13.015M in   4.992766s
      range#include?     52.290k (± 7.7%) i/s -    261.253k in   5.028989s
       plain compare      2.833M (± 8.1%) i/s -     14.089M in   5.007766s

This result for include? is slightly better, but the framing cost is still hurting us here.

Note that Range#cover? is nearly as fast as straight-up comparing values, so we're mostly seeing Date#<=> overhead in both of those cases.

@headius headius modified the milestones: JRuby 9.1.2.0, JRuby 9.1.1.0 May 11, 2016
@enebo enebo modified the milestones: JRuby 9.1.2.0, JRuby 9.1.3.0 May 23, 2016
@headius
Copy link
Member

headius commented Aug 22, 2016

Updated numbers for MRI 2.3, JRuby 9.1.3.0, and 9.1.3.0 with invokedynamic with only the date.rb patch in place:

[] ~/projects/jruby $ ruby23 bench_range_cover_include.rb 
Warming up --------------------------------------
        range#cover?   141.560k i/100ms
      range#include?     6.732k i/100ms
       range#member?     6.834k i/100ms
       plain compare   127.636k i/100ms
Calculating -------------------------------------
        range#cover?      3.251M (± 5.9%) i/s -     16.279M in   5.026191s
      range#include?     71.086k (± 4.7%) i/s -    356.796k in   5.030571s
       range#member?     71.025k (± 4.9%) i/s -    355.368k in   5.016301s
       plain compare      2.724M (± 7.1%) i/s -     13.657M in   5.039868s

Comparison:
        range#cover?:  3251053.1 i/s
       plain compare:  2724190.0 i/s - 1.19x slower
      range#include?:    71086.0 i/s - 45.73x slower
       range#member?:    71024.5 i/s - 45.77x slower

[] ~/projects/jruby $ jruby bench_range_cover_include.rb 
Warming up --------------------------------------
        range#cover?    69.549k i/100ms
      range#include?     3.226k i/100ms
       range#member?     4.052k i/100ms
       plain compare    91.638k i/100ms
Calculating -------------------------------------
        range#cover?      1.922M (± 7.6%) i/s -      9.598M in   5.025586s
      range#include?     45.076k (± 5.6%) i/s -    225.820k in   5.026104s
       range#member?     47.281k (± 5.4%) i/s -    239.068k in   5.072282s
       plain compare      1.939M (± 5.8%) i/s -      9.714M in   5.027413s

Comparison:
       plain compare:  1939188.9 i/s
        range#cover?:  1922161.8 i/s - same-ish: difference falls within error
       range#member?:    47281.1 i/s - 41.01x slower
      range#include?:    45076.0 i/s - 43.02x slower


[] ~/projects/jruby $ jruby -Xcompile.invokedynamic bench_range_cover_include.rb 
Warming up --------------------------------------
        range#cover?    89.255k i/100ms
      range#include?     3.237k i/100ms
       range#member?     4.552k i/100ms
       plain compare   111.917k i/100ms
Calculating -------------------------------------
        range#cover?      2.642M (± 8.2%) i/s -     13.120M in   5.004717s
      range#include?     50.663k (± 6.7%) i/s -    252.486k in   5.007506s
       range#member?     50.849k (± 4.6%) i/s -    254.912k in   5.024146s
       plain compare      2.718M (± 5.4%) i/s -     13.542M in   4.999904s

Comparison:
       plain compare:  2717562.8 i/s
        range#cover?:  2642457.4 i/s - same-ish: difference falls within error
       range#member?:    50848.9 i/s - 53.44x slower
      range#include?:    50663.0 i/s - 53.64x slower

There's more to do here but we've improved. 9.1.3.0 also introduces some call site caching into Java code that should make the Fixnum#nonzero? patch unnecessary.

@headius
Copy link
Member

headius commented Aug 22, 2016

More to do here in 9.1.4.0.

@headius
Copy link
Member

headius commented Nov 8, 2016

We'll do a perf push along with Ruby 2.4 support in JRuby 9.2.

@headius
Copy link
Member

headius commented May 15, 2018

We are comfortably faster than MRI in both jit modes, and the ratio for Comparable methods is similar to MRI. Calling this fixed.

@headius headius closed this as completed May 15, 2018
@headius
Copy link
Member

headius commented May 15, 2018

[] ~/projects/jruby $ jruby blah.rb
Warming up --------------------------------------
        range#cover?   189.161k i/100ms
      range#include?    24.419k i/100ms
       range#member?    26.571k i/100ms
       plain compare   220.324k i/100ms
Calculating -------------------------------------
        range#cover?     15.581M (± 8.2%) i/s -     77.178M in   5.000788s
      range#include?    318.586k (± 4.8%) i/s -      1.612M in   5.071723s
       range#member?    326.998k (± 3.1%) i/s -      1.647M in   5.042850s
       plain compare     15.053M (± 4.0%) i/s -     75.130M in   4.999913s

Comparison:
        range#cover?: 15580904.8 i/s
       plain compare: 15053493.5 i/s - same-ish: difference falls within error
       range#member?:   326998.2 i/s - 47.65x  slower
      range#include?:   318585.7 i/s - 48.91x  slower


[] ~/projects/jruby $ jruby -Xcompile.invokedynamic blah.rb
Warming up --------------------------------------
        range#cover?   195.523k i/100ms
      range#include?    25.040k i/100ms
       range#member?    28.219k i/100ms
       plain compare   232.007k i/100ms
Calculating -------------------------------------
        range#cover?     23.426M (± 7.9%) i/s -    115.750M in   4.985871s
      range#include?    331.167k (± 4.0%) i/s -      1.653M in   4.999496s
       range#member?    331.556k (± 5.0%) i/s -      1.665M in   5.035502s
       plain compare     22.685M (± 5.5%) i/s -    112.987M in   4.999446s

Comparison:
        range#cover?: 23426514.0 i/s
       plain compare: 22684630.2 i/s - same-ish: difference falls within error
       range#member?:   331555.6 i/s - 70.66x  slower
      range#include?:   331166.8 i/s - 70.74x  slower


[] ~/projects/jruby $ rvm ruby-2.5 do ruby blah.rb
Warming up --------------------------------------
        range#cover?   164.899k i/100ms
      range#include?     9.412k i/100ms
       range#member?     9.425k i/100ms
       plain compare   200.710k i/100ms
Calculating -------------------------------------
        range#cover?      2.834M (± 2.7%) i/s -     14.181M in   5.007089s
      range#include?     97.263k (± 2.8%) i/s -    489.424k in   5.035875s
       range#member?     98.124k (± 3.2%) i/s -    499.525k in   5.096090s
       plain compare      3.741M (± 2.8%) i/s -     18.867M in   5.046857s

Comparison:
       plain compare:  3741367.2 i/s
        range#cover?:  2834296.8 i/s - 1.32x  slower
       range#member?:    98124.1 i/s - 38.13x  slower
      range#include?:    97262.7 i/s - 38.47x  slower

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants