Oat gem slower at steady-state with invokedynamic than without it #4596

ivoanjo · 2017-05-10T12:46:11Z

Environment

Rubies:

jruby 9.1.8.0 (2.3.1) 2017-03-06 90fc7ab Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]

Linux: Linux maruchan 4.10.0-20-generic #22-Ubuntu SMP Thu Apr 20 09:22:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Expected Behavior

Performance on the benchmark below consistently regresses when running JRuby with -Xcompile.invokedynamic=true.

I was under the impression that after warmup, running JRuby with invokedynamic should never be slower than without it, but that does not seem to be the case on this benchmark.

require 'benchmark/ips'

require 'oat'
require 'oat/adapters/hal'

class Foo
  attr_reader :foo, :bar, :baz

  def initialize(**values)
    values.each do |(key, value)|
      respond_to?(key) ? instance_variable_set(:"@#{key}", value) : raise
    end
  end
end

class FooSerializer < Oat::Serializer
  adapter Oat::Adapters::HAL

  schema do
    property :foo, item.foo
    property :bar, item.bar
    property :baz, item.baz
  end
end

puts "Running with #{RUBY_DESCRIPTION} (JRUBY_OPTS: #{ENV['JRUBY_OPTS']})"

FOO = Foo.new(foo: 1, bar: 2, baz: 'baz')

Benchmark.ips do |benchmark|
  benchmark.time = 120
  benchmark.warmup = 120

  benchmark.report('FooSerializer') { FooSerializer.new(FOO).to_hash }

  benchmark.compare!
end

Actual Behavior

Without invokedynamic: FooSerializer 251.627k (±23.9%) i/s - 28.264M in 120.011879s
With invokedynamic: FooSerializer 33.785k (± 6.9%) i/s - 4.032M in 120.030513s

I've also tested with latest master and while the numbers go up slightly in either case, the general relation between the two results keeps the same.

The text was updated successfully, but these errors were encountered:

headius · 2017-05-10T15:20:53Z

Yeah something's definitely wrong here.

headius · 2017-05-10T17:02:25Z

It looks like something's getting endlessly invalidated, which is very costly. Most places where we use indy should already be failing over to a less-optimized path rather than constantly invalidating, so I'll poke around a bit.

headius · 2017-05-10T17:24:11Z

Ahh I see. The indy logic in 9k does not take into consideration code that calls method_missing forever, and continues to try to cache it.

headius · 2017-05-10T17:31:53Z

Ok, elaborating on this a bit more.

In any call site (using indy) we wrap the cached method in a SwitchPoint from the original class. This acts as an "off" switch for the cache, flipped when the original class is modified. We lazily create that SwitchPoint index only when requested.

In the case of method_missing we will request that SwitchPoint and then not use it. This is normally not a big deal; the method_missing call against the same class again will get the same SwitchPoint.

However, the oat library is constantly generating new classes. This in itself is a very bad practice, since it also affects other Ruby implementations that cache methods based on the last-seen class. In JRuby, however, it has the additional wrinkle that because it's a new class every time, it's a new SwitchPoint every time. So we go through all the effort of setting up the class and the SwitchPoint only to throw it away and start again the next iteration.

This is the main source of performance. It's marginally indy-related, since there are call sites and SwitchPoint and so on, but it's more a symptom of class churn in oat.

The flaw in the JRuby logic is that if we keep failing to cache a method, because we're supposed to call through method_missing, we never give up trying to cache methods at that call site.

The simple fix will be to "fail" the call site after some number of method_missing calls, as we do after some number of invalidated, previously-cached calls. This will cause it to use the non-caching path and avoid the overhead.

I'm not sure what a more comprehensive fix would be. The classes entering your benchmark's schema block are always new and always using method_missing. There's not really anything to cache, as a result.

headius · 2017-05-10T17:47:42Z

Ok, so with a quick fix that immediately fails the method_missing based call site (which, btw, is the "property" calls inside the "schema" block in your benchmark) returns perf to more what we'd expect:

[] ~/projects/jruby $ jruby -Xcompile.invokedynamic blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +indy +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    12.525k i/100ms
Calculating -------------------------------------
       FooSerializer    200.413k (±18.0%) i/s -     19.890M in 120.242456s

[] ~/projects/jruby $ jruby blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    10.318k i/100ms
Calculating -------------------------------------
       FooSerializer    152.951k (±16.0%) i/s -     16.230M in 120.297981s

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

I will put a more thoughtful fix into 9.1.9.0 for this.

Fixes jruby#4596. JRuby currently does not cache anything for method_missing calls (i.e. calls to methods that do not exist on the target class's method table). Instead, after discovering the missing method, it proceeds down a slow lookup path, dispatching to method_missing with a symbolic method name. However this logic still attempts to acquire a SwitchPoint from the target class, since that SP must be acquired before the initial method lookup to ensure proper invalidation ordering. Normally, this isn't a problem. If the class is still relatively static, the SwitchPoint will be created once, and then the cost of calling method_missing is the same as for non-indy. However for type very rapidly. This led to a large number of unused SwitchPoint being created, one for each class. And since the method_missing path never cached anything, it never updated call site thresholds, resulting in call sites that would continue this logic forever. Normal calls, if they fail repeatedly, will eventually fail the entire site and revert to non-indy method dispatch (currently a simple monomorphic cache). This patch lifts the threshold-bumping out of the caching logic so it can also be used by method_missing dispatches. This allows repeated method_missing calls against different classes to also fail the site, reverting it to non-indy behavior. With this patch, the benchmark in jruby#4596 runs faster with indy than without, as expected.

headius · 2017-05-10T19:02:33Z

I have pushed #4602 which contains a reasonable fix for this issue.

I will again point out, however, that the pattern being benchmarked will never perform "well" due to the large number of singleton classes being created.

ivoanjo · 2017-05-11T12:45:34Z

Wow @headius thanks for the lightning fast answer and awesomely detailed analysis!

I started looking at this as I was benchmarking a test application with invokedynamic enabled, and in that configuration the serialization of responses (this was an http json hal api) was spending a noticeable amount of time on this part of the code.

With the invokedynamic off configuration the impact became rather small compared to other parts of request processing, but as I improve those parts I think I'll get to the point where this will become a lot more noticeable.

I'll report this to the oat developers and perhaps I'll have some time to pick it up and try to fix it myself.

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

Just one question regarting to your comment: are the singleton objects (and the constant generation of new classes) a result of the calls to instance_eval? I don't see any other place which may be causing this on this test case.

enebo added this to the JRuby 9.1.9.0 milestone May 10, 2017

headius mentioned this issue May 10, 2017

Also increment thresholds in indy sites for method_missing. #4602

Merged

headius closed this as completed in #4602 May 10, 2017

headius added invokedynamic jit JRuby 9000 performance labels May 10, 2017

ivoanjo mentioned this issue May 17, 2017

Improve serialization performance by avoiding instance_eval and method_missing ismasan/oat#73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Oat gem slower at steady-state with invokedynamic than without it #4596

Oat gem slower at steady-state with invokedynamic than without it #4596

ivoanjo commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

ivoanjo commented May 11, 2017

Oat gem slower at steady-state with invokedynamic than without it #4596

Oat gem slower at steady-state with invokedynamic than without it #4596

Comments

ivoanjo commented May 10, 2017

Environment

Expected Behavior

Actual Behavior

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

headius commented May 10, 2017

ivoanjo commented May 11, 2017