Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oat gem slower at steady-state with invokedynamic than without it #4596

Closed
ivoanjo opened this issue May 10, 2017 · 7 comments
Closed

Oat gem slower at steady-state with invokedynamic than without it #4596

ivoanjo opened this issue May 10, 2017 · 7 comments

Comments

@ivoanjo
Copy link
Contributor

ivoanjo commented May 10, 2017

Environment

Rubies:

  • jruby 9.1.8.0 (2.3.1) 2017-03-06 90fc7ab Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]

Linux: Linux maruchan 4.10.0-20-generic #22-Ubuntu SMP Thu Apr 20 09:22:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Expected Behavior

Performance on the benchmark below consistently regresses when running JRuby with -Xcompile.invokedynamic=true.

I was under the impression that after warmup, running JRuby with invokedynamic should never be slower than without it, but that does not seem to be the case on this benchmark.

require 'benchmark/ips'

require 'oat'
require 'oat/adapters/hal'

class Foo
  attr_reader :foo, :bar, :baz

  def initialize(**values)
    values.each do |(key, value)|
      respond_to?(key) ? instance_variable_set(:"@#{key}", value) : raise
    end
  end
end

class FooSerializer < Oat::Serializer
  adapter Oat::Adapters::HAL

  schema do
    property :foo, item.foo
    property :bar, item.bar
    property :baz, item.baz
  end
end

puts "Running with #{RUBY_DESCRIPTION} (JRUBY_OPTS: #{ENV['JRUBY_OPTS']})"

FOO = Foo.new(foo: 1, bar: 2, baz: 'baz')

Benchmark.ips do |benchmark|
  benchmark.time = 120
  benchmark.warmup = 120

  benchmark.report('FooSerializer') { FooSerializer.new(FOO).to_hash }

  benchmark.compare!
end

Actual Behavior

  • Without invokedynamic: FooSerializer 251.627k (±23.9%) i/s - 28.264M in 120.011879s
  • With invokedynamic: FooSerializer 33.785k (± 6.9%) i/s - 4.032M in 120.030513s

I've also tested with latest master and while the numbers go up slightly in either case, the general relation between the two results keeps the same.

@enebo enebo added this to the JRuby 9.1.9.0 milestone May 10, 2017
@headius
Copy link
Member

headius commented May 10, 2017

Yeah something's definitely wrong here.

@headius
Copy link
Member

headius commented May 10, 2017

It looks like something's getting endlessly invalidated, which is very costly. Most places where we use indy should already be failing over to a less-optimized path rather than constantly invalidating, so I'll poke around a bit.

@headius
Copy link
Member

headius commented May 10, 2017

Ahh I see. The indy logic in 9k does not take into consideration code that calls method_missing forever, and continues to try to cache it.

@headius
Copy link
Member

headius commented May 10, 2017

Ok, elaborating on this a bit more.

In any call site (using indy) we wrap the cached method in a SwitchPoint from the original class. This acts as an "off" switch for the cache, flipped when the original class is modified. We lazily create that SwitchPoint index only when requested.

In the case of method_missing we will request that SwitchPoint and then not use it. This is normally not a big deal; the method_missing call against the same class again will get the same SwitchPoint.

However, the oat library is constantly generating new classes. This in itself is a very bad practice, since it also affects other Ruby implementations that cache methods based on the last-seen class. In JRuby, however, it has the additional wrinkle that because it's a new class every time, it's a new SwitchPoint every time. So we go through all the effort of setting up the class and the SwitchPoint only to throw it away and start again the next iteration.

This is the main source of performance. It's marginally indy-related, since there are call sites and SwitchPoint and so on, but it's more a symptom of class churn in oat.

The flaw in the JRuby logic is that if we keep failing to cache a method, because we're supposed to call through method_missing, we never give up trying to cache methods at that call site.

The simple fix will be to "fail" the call site after some number of method_missing calls, as we do after some number of invalidated, previously-cached calls. This will cause it to use the non-caching path and avoid the overhead.

I'm not sure what a more comprehensive fix would be. The classes entering your benchmark's schema block are always new and always using method_missing. There's not really anything to cache, as a result.

@headius
Copy link
Member

headius commented May 10, 2017

Ok, so with a quick fix that immediately fails the method_missing based call site (which, btw, is the "property" calls inside the "schema" block in your benchmark) returns perf to more what we'd expect:

[] ~/projects/jruby $ jruby -Xcompile.invokedynamic blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +indy +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    12.525k i/100ms
Calculating -------------------------------------
       FooSerializer    200.413k (±18.0%) i/s -     19.890M in 120.242456s

[] ~/projects/jruby $ jruby blah.rb
Running with jruby 9.1.9.0-SNAPSHOT (2.3.3) 2017-05-10 941cfe8 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +jit [darwin-x86_64] (JRUBY_OPTS: )
Warming up --------------------------------------
       FooSerializer    10.318k i/100ms
Calculating -------------------------------------
       FooSerializer    152.951k (±16.0%) i/s -     16.230M in 120.297981s

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

I will put a more thoughtful fix into 9.1.9.0 for this.

headius added a commit to headius/jruby that referenced this issue May 10, 2017
Fixes jruby#4596.

JRuby currently does not cache anything for method_missing calls
(i.e. calls to methods that do not exist on the target class's
method table). Instead, after discovering the missing method, it
proceeds down a slow lookup path, dispatching to method_missing
with a symbolic method name.

However this logic still attempts to acquire a SwitchPoint from
the target class, since that SP must be acquired before the
initial method lookup to ensure proper invalidation ordering.

Normally, this isn't a problem. If the class is still relatively
static, the SwitchPoint will be created once, and then the cost
of calling method_missing is the same as for non-indy. However for
type very rapidly. This led to a large number of unused
SwitchPoint being created, one for each class. And since the
method_missing path never cached anything, it never updated call
site thresholds, resulting in call sites that would continue this
logic forever.

Normal calls, if they fail repeatedly, will eventually fail the
entire site and revert to non-indy method dispatch (currently a
simple monomorphic cache).

This patch lifts the threshold-bumping out of the caching logic
so it can also be used by method_missing dispatches. This allows
repeated method_missing calls against different classes to also
fail the site, reverting it to non-indy behavior. With this patch,
the benchmark in jruby#4596 runs faster with indy than without, as
expected.
@headius
Copy link
Member

headius commented May 10, 2017

I have pushed #4602 which contains a reasonable fix for this issue.

I will again point out, however, that the pattern being benchmarked will never perform "well" due to the large number of singleton classes being created.

@ivoanjo
Copy link
Contributor Author

ivoanjo commented May 11, 2017

Wow @headius thanks for the lightning fast answer and awesomely detailed analysis!

I started looking at this as I was benchmarking a test application with invokedynamic enabled, and in that configuration the serialization of responses (this was an http json hal api) was spending a noticeable amount of time on this part of the code.

With the invokedynamic off configuration the impact became rather small compared to other parts of request processing, but as I improve those parts I think I'll get to the point where this will become a lot more noticeable.

I'll report this to the oat developers and perhaps I'll have some time to pick it up and try to fix it myself.

I am not familiar with this library. If this pattern you are benchmarking is typical of hot-path runtime execution, then the library author may want to reconsider creating so many singleton objects and doing method_missing dispatch against them, because that's going to be a system-wide perf hit on MRI and locally rather poor on JRuby even with this fix.

Just one question regarting to your comment: are the singleton objects (and the constant generation of new classes) a result of the calls to instance_eval? I don't see any other place which may be causing this on this test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants