Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Truffle] Unexpectedly very slow? #3355

Closed
PragTob opened this issue Sep 26, 2015 · 28 comments
Closed

[Truffle] Unexpectedly very slow? #3355

PragTob opened this issue Sep 26, 2015 · 28 comments
Assignees
Milestone

Comments

@PragTob
Copy link

PragTob commented Sep 26, 2015

Hi there thanks for all your work :)

I have a little pure ruby project in the making ( rubykon ) . It is CPU bound and performance critical so I wanted to see truffle/graal shine. However, it is super slow for some reason. It is ~10 times slower than the slowest other ruby implementation that I tested (1.9.3 that is). And almost 25 times slower than the fastest (jruby 9k). You can see this file for benchmark results - the most important is the 19x19 full playout.

The benchmark uses 5 seconds warmup and 20 seconds execution time. It's all the same code (different input sizes) - so when it hits the last important benchmark it already had 55 seconds of warmup. I also tried larger warmup times (20 and 30 I believe), but it didn't help, it only blew up the heap.

I use the graal nightly from today:

tobi@tobi-desktop ~/github/rubykon $ rvm use jruby-dev-graal
Using /home/tobi/.rvm/gems/jruby-dev-graal
tobi@tobi-desktop ~/github/rubykon $ ruby -v
jruby 9.0.2.0-SNAPSHOT (2.2.2) 2015-09-26 f2612a2 OpenJDK 64-Bit Server VM 25.40-b25-internal-graal-0.7 on 1.8.0-internal-b132 +jit [linux-amd64]
tobi@tobi-desktop ~/github/rubykon $ ruby -X+T -e 'puts Truffle.graal?'
true
tobi@tobi-desktop ~/github/rubykon $ ruby -X+T -J-Xmx1500m -Xtruffle.printRuntime=true benchmark/full_playout.rb 
jruby: warning: unknown property jruby.truffle.printRuntime
Calculating -------------------------------------
9x9 full playout (+ score)
                         1.000  i/100ms
13x13 full playout (+ score)
                         1.000  i/100ms
19x19 full playout (+ score)
                         1.000  i/100ms
Calculating -------------------------------------
9x9 full playout (+ score)
                          6.631  (± 30.2%) i/s -    115.000 
13x13 full playout (+ score)
                          3.931  (± 25.4%) i/s -     73.000 
19x19 full playout (+ score)
                          1.665  (± 0.0%) i/s -     31.000 
tobi@tobi-desktop ~/github/rubykon $ 

Also -Xtruffle.printRuntime=true does not seem to work for me.

Btw. the current implementation is very naive and will hopefully get faster soon as various aspects are optimized (less object allocations etc.), however this seemed very strange and I thought you might be interested. This version is tagged 0.1.

Ah yes, benchmark is done using the benchmark-ips gem that I put in a ignored folder so that graal can load it without gem install :)

Hope this helps! :)
Tobi

@chrisseaton
Copy link
Contributor

This looks like it could be a fantastic little benchmark, so I'd be very happy to take a look at it in Truffle and figure out why it's slow. Thanks for letting us know.

@chrisseaton chrisseaton self-assigned this Sep 27, 2015
@chrisseaton chrisseaton added this to the truffle-dev milestone Sep 27, 2015
@chrisseaton
Copy link
Contributor

First of all, I want to run your specs in JRuby+Truffle, as there's no use benchmarking something that is broken. You are right that we don't support RubyGems very well yet, but we have a tool to help work around this.

If you get the source code for JRuby, you can do:

$ tool/jt.rb install-tool

To install a small tool, jruby+truffle, which will help us work with your gem. Inside your gem's repository I then run:

$ jruby+truffle setup

This gets normal JRuby to help us install the gems and patch a couple of things we don't run correctly yet in common gems. You can then run your specs:

$ jruby+truffle run -S rspec spec/rubykon/
$ ../jruby/bin/jruby -X\+T -Xtruffle.core.load_path\=../jruby/truffle/src/main/ruby -r ./.jruby\+truffle_bundle/bundler/setup.rb .jruby\+truffle_bundle/jruby\+truffle/2.2.0/gems/rspec-core-3.3.2/exe/rspec spec/rubykon/
Coverage may be inaccurate; Try setting JRUBY_OPTS="-Xcli.debug=true --debug"
............................................................................................................................................................................................................................

Finished in 5.07 seconds (files took 4.84 seconds to load)
220 examples, 0 failures

That's a very good start. We can run all of your specs, off-the-shelf. Tomorrow I'll start to look at why it isn't very fast.

I don't think we have -Xtruffle.printRuntime=true anymore - I'll update the documentation.

@PragTob
Copy link
Author

PragTob commented Sep 27, 2015

👍 thanks @chrisseaton that's some very useful information, should maybe also go into the docs? :)

@chrisseaton
Copy link
Contributor

I'll keep a running log here of what I'm doing. Maybe it will turn into an interesting case study.

The first thing I've done is improve our random number generator primitives, which your benchmark uses. These are used by the Rubinius code we're re-using, and are implementing using the same backend as JRuby. Before they weren't quite right for compilation and weren't producing the same numbers as other implementations for the same seed. I've fixed both of those issues in 88c38ca80c021566bf98e80633951670312478f5.

@chrisseaton
Copy link
Contributor

Then I added an option to our jruby+truffle tool 07dec75 so that I can pass in VM options like -G:+TraceTruffleCompilation and TraceTruffleTransferToIntepreter that let me see a high level of what is being optimised and deoptimised.

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TraceTruffleCompilation benchmark/full_playout.rb 
warning: ignoring extraneous `ruby-' prefix in version `ruby-2.2.3'
         (set by /Users/chrisseaton/Documents/ruby/rubykon/.ruby-version)
$ JAVACMD=../../graal/graalvm-jdk1.8.0/bin/java ../jruby/bin/jruby -X\+T -J-G:\+TraceTruffleCompilation -Xtruffle.core.load_path\=../jruby/truffle/src/main/ruby -r ./.jruby\+truffle_bundle/bundler/setup.rb benchmark/full_playout.rb
Calculating -------------------------------------
9x9 full playout (+ score)
                    [truffle] opt done         on_board?:board.rb:57 <opt>                                 |ASTSize      57/   91 |Time   525( 440+85  )ms |DirectCallNodes I    6/D    0 |GraalNodes   119/  145 |CodeSize          407 |Source    board.rb:57 
[truffle] opt done         []:board.rb:23 <opt>                                        |ASTSize      38/   74 |Time   250( 143+107 )ms |DirectCallNodes I    4/D    0 |GraalNodes   240/  338 |CodeSize          720 |Source    board.rb:23 
[truffle] opt done         block in inject:board.rb:36 <opt> <split-0-U>               |ASTSize      92/  267 |Time   472( 291+181 )ms |DirectCallNodes I   13/D    0 |GraalNodes   714/  951 |CodeSize         2839 |Source    board.rb:36 
[truffle] opt done         nil?:core: Kernel#nil? <opt> <split-4-U>                    |ASTSize       4/    4 |Time    26(  21+6   )ms |DirectCallNodes I    0/D    0 |GraalNodes    17/    3 |CodeSize           66 |Source core: Kernel#nil? 
[truffle] opt done         nil?:core: Kernel#nil? <opt> <split-5-U>                    |ASTSize       4/    4 |Time    17(  15+2   )ms |DirectCallNodes I    0/D    0 |GraalNodes    17/    3 |CodeSize           66 |Source core: Kernel#nil? 
[truffle] opt done         inject:core: Array#inject <split-1-U>         
etc etc etc...

@chrisseaton
Copy link
Contributor

I started with the simpler of the benchmarks, benchmarks/scoring.rb (note that with the tool no special code is needed to require benchmark/ips. I enabled the options -J-G:+TruffleCompilationExceptionsAreFatal, which makes compiler faults stop the program, rather than the VM just trying to make do when they happen, and -J-G:-GraphPE, which on this release of Graal produces better errors (it's a newer graph-based partial evaluator, compared to the older bytecode-based partial evaluator which was better tuned for error messages).

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TruffleCompilationExceptionsAreFatal -J-G:-GraphPE benchmark/scoring.rb

Using this I spotted a couple of errors.

Our code for reading the time zone, which is used in the benchmark/ips harness, had unbounded recursion which caused a stack-overflow in the partial evaluator expansion phase.

  at parsing org.joda.time.DateTimeZone.forID(DateTimeZone.java:211)
  at parsing org.joda.time.DateTimeZone.getDefault(DateTimeZone.java:151)
  at parsing org.joda.time.DateTimeZone.forID(DateTimeZone.java:211)
  at parsing org.joda.time.DateTimeZone.getDefault(DateTimeZone.java:151)
  at parsing org.joda.time.DateTimeZone.forID(DateTimeZone.java:211)
  at parsing org.joda.time.DateTimeZone.getDefault(DateTimeZone.java:151)
  at parsing org.joda.time.DateTimeZone.forID(DateTimeZone.java:211)
  at parsing org.joda.time.DateTimeZone.getDefault(DateTimeZone.java:151)
  at parsing org.jruby.truffle.nodes.time.ReadTimeZoneNode.execute(ReadTimeZoneNode.java:50)
  at parsing org.jruby.truffle.nodes.rubinius.TimePrimitiveNodes$TimeSNowPrimitiveNode.timeSNow(TimePrimitiveNodes.java:50)

Secondly, some allocations were using a 'shape' for objects that was not constant and so couldn't be compiled efficiently.

Caused by: com.oracle.graal.api.code.SourceStackTrace$1: Partial evaluation did not reduce value to a constant, is a regular compiler node: 1714|CheckCast
  at org.jruby.truffle.runtime.layouts.StringLayoutImpl.createString(StringLayoutImpl.java:76)
  at org.jruby.truffle.nodes.core.StringNodes$AllocateNode.allocate(StringNodes.java:87)

These problems are fixed in db649ae.

We're still slower than MRI though, so the investigation continues.

@chrisseaton
Copy link
Contributor

My next step is to break down the code path into some micro-benchmarks and test them individually until we find the problem.

My first suspicion was Board#each, which uses Array#flatten which we have implemented using the Rubinius code

# Helper to recurse through flattening since the method
# is not allowed to recurse itself. Detects recursive structures.
def recursively_flatten(array, out, max_levels = -1)
modified = false
# Strict equality since < 0 means 'infinite'
if max_levels == 0
out.concat(array)
return false
end
max_levels -= 1
recursion = Thread.detect_recursion(array) do
m = Rubinius::Mirror::Array.reflect array
i = m.start
total = i + m.total
tuple = m.tuple
while i < total
o = tuple.at i
if ary = Rubinius::Type.check_convert_type(o, Array, :to_ary)
modified = true
recursively_flatten(ary, out, max_levels)
else
out << o
end
i += 1
end
end
raise ArgumentError, "tried to flatten recursive array" if recursion
modified
end
. Sometimes we have problems with this code using patterns that we haven't optimised yet, and so have to write a Java implementation instead. block_given? is the main problem (#3322), as this currently causes frames to be materialised (turned into heap objects so they can be accessed by the block_given? method) and I thought this method would be using that somehow. However it turns out this method is fine.

board_19 = Rubykon::Board.new 19

benchmark.report '19x19 each' do
  board_19.each do |stone|
  end
end
$ bundle exec ruby benchmark/scoring_micros.rb # 2.2.3
Calculating -------------------------------------
              19x19      2.381k i/100ms
-------------------------------------------------
              19x19      24.637k (± 4.5%) i/s -    738.110k

$ ~/.rbenv/versions/jruby-9.0.1.0/bin/ruby -I lib benchmark/scoring_micros.rb 
Calculating -------------------------------------
          19x19 each     1.863k i/100ms
-------------------------------------------------
          19x19 each     18.952k (± 4.4%) i/s -    568.215k

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal benchmark/scoring_micros.rb
Calculating -------------------------------------
          19x19 each     7.928k i/100ms
-------------------------------------------------
          19x19 each     92.465k (± 7.6%) i/s -      2.751M

There we are running at 92k, compared to JRuby at 18k, and MRI at 24k. So that is a good sign - we are running parts of this faster than anyone else. Just looking for the problem still...

@chrisseaton
Copy link
Contributor

Within that loop the only thing that looks immediately interesting is score_empty_cutting_point, so I benchmarked that, along with methods inside it.

Benchmark.ips do |benchmark|
  benchmark.config time: 30, warmup: 60

  board = Rubykon::Board.new 19
  cutting_point = board[8, 8]
  scorer = Rubykon::GameScorer.new

  benchmark.report 'score_empty_cp' do
    game_score = {Rubykon::Board::BLACK_COLOR => 0, Rubykon::Board::WHITE_COLOR => Rubykon::Game::DEFAULT_KOMI}
    scorer.score_empty_cutting_point(cutting_point, board, game_score)
  end

  benchmark.report 'neighbour_c_of' do
    board.neighbour_colors_of(cutting_point.x, cutting_point.y)
  end

  neighbour_colors = board.neighbour_colors_of(cutting_point.x, cutting_point.y)

  benchmark.report 'find_cand_c_of' do
    scorer.find_candidate_color(neighbour_colors)
  end

  candidate_color = scorer.find_candidate_color(neighbour_colors)

  benchmark.report 'only_one_c_adj?' do
    scorer.only_one_color_adjacent?(neighbour_colors, candidate_color)
  end
end
$ bundle exec ruby benchmark/scoring_micros.rb
Calculating -------------------------------------
      score_empty_cp    14.099k i/100ms
      neighbour_c_of    18.790k i/100ms
      find_cand_c_of    62.174k i/100ms
     only_one_c_adj?    60.057k i/100ms
-------------------------------------------------
      score_empty_cp    183.590k (± 6.7%) i/s -      5.485M
      neighbour_c_of    262.356k (± 5.9%) i/s -      7.854M
      find_cand_c_of      1.350M (± 7.0%) i/s -     40.351M
     only_one_c_adj?      1.303M (± 6.5%) i/s -     38.917M


$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TruffleCompilationExceptionsAreFatal -J-G:-GraphPE benchmark/scoring_micros.rb
Calculating -------------------------------------
      score_empty_cp     3.675k i/100ms
      neighbour_c_of    50.355k i/100ms
      find_cand_c_of     5.162k i/100ms
     only_one_c_adj?     4.494k i/100ms
-------------------------------------------------
      score_empty_cp     47.997k (± 14.0%) i/s -      1.397M
      neighbour_c_of      5.061M (± 11.7%) i/s -    148.648M
      find_cand_c_of     65.009k (± 6.7%) i/s -      1.941M
     only_one_c_adj?     60.725k (± 17.1%) i/s -      1.726M

It looks like neighbour_colors_of does very well, and find_candidate_color and only_one_color_adjacent? are the problems. Both of these do use block_given? (https://github.com/jruby/jruby/blob/e139d9ce7a975ff94c81d1d4c77e68b337f9facc/truffle/src/main/ruby/core/rubinius/common/enumerable.rb#L362-362) - that is likely to be the key problem which I'll look at next.

Along the way I fixed another bug, in e139d9c.

@PragTob
Copy link
Author

PragTob commented Sep 29, 2015

Thank you very much for the running journal @chrisseaton - it's a very interesting read for me. Thanks for all the hard work 👍

@chrisseaton
Copy link
Contributor

Benchmarking block_given? makes the problem clear:

def foo
  if block_given?
    yield 14, 2
  else
    14 + 2
  end
end

Benchmark.ips do |benchmark|
  benchmark.config time: 30, warmup: 60

  benchmark.report 'b_g? with block' do
    foo do |a, b|
      a + b
    end
  end

  benchmark.report 'b_g? without block' do
    foo
  end

end
$ bundle exec ruby benchmark/block_given.rb
Calculating -------------------------------------
     b_g? with block    91.714k i/100ms
  b_g? without block    96.982k i/100ms
-------------------------------------------------
     b_g? with block      4.214M (± 7.2%) i/s -    125.648M
  b_g? without block      4.745M (± 6.6%) i/s -    141.497M

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TruffleCompilationExceptionsAreFatal -J-G:-GraphPE benchmark/block_given.rb
Calculating -------------------------------------
     b_g? with block     4.965k i/100ms
  b_g? without block     5.268k i/100ms
-------------------------------------------------
     b_g? with block     83.177k (± 7.6%) i/s -      2.478M
  b_g? without block     71.469k (± 6.1%) i/s -      2.134M

@chrisseaton
Copy link
Contributor

I've optimised block_given? in 439da5d. I'm not entirely sure of the performance implications of it on the rest of the system yet so it's on a branch.

Repeating that same benchmark:

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TruffleCompilationExceptionsAreFatal -J-G:-GraphPE benchmark/block_given.rb 
Calculating -------------------------------------
     b_g? with block   132.008k i/100ms
  b_g? without block    61.858k i/100ms
-------------------------------------------------
     b_g? with block    125.413M (± 15.9%) i/s -      3.518B
  b_g? without block     61.834M (± 1.4%) i/s -      1.838B

In the best case, that's 1700x faster than before, and now 26x faster than MRI, which is much better.

It's also orders of magnitude faster than both Rubinius and JRuby in the base cases.

$ ~/.rbenv/versions/rbx-2.5.8/bin/ruby -I lib/rubykon benchmark/block_given.rb 
Calculating -------------------------------------
     b_g? with block   305.758k i/100ms
  b_g? without block   389.074k i/100ms
-------------------------------------------------
     b_g? with block      6.572M (± 4.4%) i/s -    196.602M
  b_g? without block      6.725M (± 4.7%) i/s -    201.151M

$ ~/.rbenv/versions/jruby-9.0.1.0/bin/ruby -I lib/rubykon benchmark/block_given.rb 
Calculating -------------------------------------
     b_g? with block   109.450k i/100ms
  b_g? without block   152.037k i/100ms
-------------------------------------------------
     b_g? with block      6.113M (± 7.9%) i/s -    181.796M
  b_g? without block     10.617M (± 8.8%) i/s -    315.021M

Even better than that, block_given? is now a constant in compiled code when in the best conditions, as tested by 439da5d#diff-6e076f3b02163350062f3a29d86fb51bR17.

Let's talk in some depth about what block_given? does, why it was slow before, and why it's fast now.

block_given? tells you whether or not you received a block when you were called. You can use it to decide whether to yield or to maybe return an Enumerator. The complication is that block_given? is a method, not a keyword, so it really needs to consider whether or not the calling method received a block, rather than whether it received a block. That means it's no longer just looking at it's own encapsulated local state, it now has to reach outside of its stack frame, and look inside the stack frame of the caller.

In MRI and implementations like JRuby's interpreter mode this is easy, as stack frames are a linked list of heap-allocated objects, and the block of the caller frame is easy to access.

In an optimising implementation it becomes much more complicated because they try to use the native stack to store Ruby stack frames. It can be easy to put things on the native stack, but then languages like C, C++ and Java don't provide any easy way to read non-local stack frames so it's hard to get things back off it unless you're in the same method.

In our previous implementation of block_given? we used an API that Truffle provides that does allow us to read the caller frame, but this is very slow as it's more of a debug operation rather than something designed for core functionality.

Our new implementation of block_given? adds an option for methods which need the caller frame to receive a materialised version of that frame, as if it was an extra argument. This means a version of the frame that has been copied off the native stack onto the heap. That sounds slow, as we're now doing heap allocation, but we couple this with a new strong hint to inline a method. If block_given? is inlined then the materialised frame will be de-materialised by our partial escape analysis like any other object, and the heap allocation will be removed. This will reduce block_given? to just a local variable read.

You may be surprised that we don't have a hint to inline already, but this is because in most cases we can just let Truffle to do the right thing for us. Here we do know better so I've added that hint.

I'll continue looking at other parts of your benchmark tomorrow, but I'm hopeful this and maybe a few other little fixes will make the whole thing fast.

@chrisseaton
Copy link
Contributor

This solved the problem in the scoring micro-benchmarks.

$ bundle exec ruby benchmark/scoring_micros.rb
Calculating -------------------------------------
      score_empty_cp    14.099k i/100ms
      neighbour_c_of    18.790k i/100ms
      find_cand_c_of    62.174k i/100ms
     only_one_c_adj?    60.057k i/100ms
-------------------------------------------------
      score_empty_cp    183.590k (± 6.7%) i/s -      5.485M
      neighbour_c_of    262.356k (± 5.9%) i/s -      7.854M
      find_cand_c_of      1.350M (± 7.0%) i/s -     40.351M
     only_one_c_adj?      1.303M (± 6.5%) i/s -     38.917M

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TruffleCompilationExceptionsAreFatal -J-G:-GraphPE benchmark/scoring_micros.rb
Calculating -------------------------------------
      score_empty_cp    46.344k i/100ms
      neighbour_c_of    52.022k i/100ms
      find_cand_c_of    59.342k i/100ms
     only_one_c_adj?    44.785k i/100ms
-------------------------------------------------
      score_empty_cp      4.200M (± 15.4%) i/s -    119.660M
      neighbour_c_of      5.634M (± 13.0%) i/s -    165.326M
      find_cand_c_of     29.485M (± 24.1%) i/s -    837.138M
     only_one_c_adj?      3.014M (± 14.6%) i/s -     88.271M

And in the scoring benchmarks as a whole.

$ bundle exec ruby benchmark/scoring.rb
Calculating -------------------------------------
         9x9 scoring   970.000  i/100ms
       13x13 scoring   647.000  i/100ms
       19x19 scoring   235.000  i/100ms
-------------------------------------------------
         9x9 scoring     10.286k (± 5.8%) i/s -    307.490k
       13x13 scoring      6.681k (± 6.9%) i/s -    199.923k
       19x19 scoring      2.544k (± 5.1%) i/s -     76.140k

$ jruby+truffle --graal-path ../../graal/graalvm-jdk1.8.0/bin/java run --graal -J-G:+TruffleCompilationExceptionsAreFatal -J-G:-GraphPE benchmark/scoring.rb
Calculating -------------------------------------
         9x9 scoring     3.872k i/100ms
       13x13 scoring     3.268k i/100ms
       19x19 scoring     1.716k i/100ms
-------------------------------------------------
         9x9 scoring     68.595k (± 10.6%) i/s -      2.025M
       13x13 scoring     36.216k (± 3.3%) i/s -      1.085M
       19x19 scoring     19.130k (± 4.4%) i/s -    573.144k

We're now 7.5x faster than MRI at scoring.

I'll look at the calls that go into the playout benchmarks next and carefully check all the parts of it in the same way. It looks like scoring should be a fairly small component of the total runtime for the overall benchmark, so it's important to check the rest just as carefully.

@chrisseaton
Copy link
Contributor

@PragTob you couldn't help me by writing some more micro benchmarks for parts of playout could you? I have much of it as micro benchmarks, but Group.join_group_of_friendly_stones, Group.add_liberties, Group.take_liberties_of_enemies I think need some kind of state that I'm not getting right. They also look like they make up a sizeable proportion of the total time.

I've got placeholders for them in my fork - could you pull that and fix? But you may not want to pull this into your master as it has some random other changes.

An important point is we need to reset the state between each iteration - so the benchmark should not create larger and larger groups each iteration for example.

https://github.com/chrisseaton/rubykon/blob/c8fd67c6b383516324ff7b339a4e8023e7a0b1f3/benchmark/playout_micros.rb#L105-L113

@chrisseaton
Copy link
Contributor

Actually a few of these micro benchmarks have issues. Current graph of speedup relative to MRI, so there is still other work to do.

picture1

@chrisseaton
Copy link
Contributor

Hmm... I think we're interacting badly with benchmark-ips here. When I set up a simple case like:

require_relative '../lib/rubykon'

class Rubykon::RandomPlayout
  public :generate_random_move
end

game = Rubykon::Game.new 19
board = game.board
playout = Rubykon::RandomPlayout.new

loop do
  start = Time.now
  10_000.times do
    playout.generate_random_move(game)
  end
  puts Time.now - start
end

We are then 10x faster than MRI, when with benchmark-ips we're showing as 14x slower.

@PragTob
Copy link
Author

PragTob commented Sep 30, 2015

@chrisseaton thanks for all the work. I'm back home from work, I'll go through emails and stuff and then give the benchmarks a look. :)

@PragTob
Copy link
Author

PragTob commented Sep 30, 2015

@chrisseaton added the missing methods over in chrisseaton/rubykon#1 - hope it helps. FYI I have work on a branch that makes the Stone class disappear (and that sort of setup work a bit easier, as the board and stone representation is way more lightweight). Doesn't have a lot of impact on perf in ebnchmark-ips (as it runs the GC inbetween test runs I believe). Also moving forward to not so silly move generation. But that will land on Friday/Saturday at the earliest (tomorrow is ruby user group Berlin time!)

Thanks for your great work on this :)

@brauliobo
Copy link

@chrisseaton any clue why is that b_g? without block about half the speed of b_g? with block?

@thedarkone
Copy link
Contributor

@chrisseaton any clue why is that b_g? without block about half the speed of b_g? with block?

Yeah, I was curious as well, it is really counter intuitive that a block-less call would be slower. On the face of it going through a block should surely be more work.

@chrisseaton
Copy link
Contributor

Reference block_given? - I'm not sure yet. It's on my todo list to figure out. I would expect the performance to be exactly the same, as we should definitely be able to inline and remove that block, so the non-block case being slower is very weird. I need to look at the IR to see what is going on.

@chrisseaton
Copy link
Contributor

Ok, let’s resolve the issue of block_given?, and why it looks faster with a block than without. I have a hunch that benchmark-ips is giving us misleading data. I’ll explain why later on, but we can start off using some other tools.

I have my own benchmark harness, developed as part of my doctoral studies over the last few years, bench9000. It’s different in many ways to benchmark-ips, such as in running each benchmark in a new VM, and not trying to automagically guess the scale of the benchmark. I’ll discuss these differences later.

Using my benchmark harness it looks like there is no statistically significant difference between the performance of block_given? with and without a block (the error bars overlap).

4f4de568-e92b-4c07-9668-8d580d413728

Benchmarking is always going to be non-deterministic as it involves measuring a wall clock. We can get some more certain by looking at what the compiler is outputting, rather than how fast it runs when we measure it.

Graal has a feature where it can show you the graph it uses to compile Truffle programs (and Java programs). I looked at the graph at the latest possible moment, after all the optimisations were applied and all the abstractions of Ruby were hopefully removed, right before machine instructions are emitted.

def foo
  if block_given?
    yield 14, 2
  else
    14 + 2
  end
end

loop do
  foo
end
$ jt run --igv --full test.rb

The graph for block_given? without a block looks like this:

ddf99d03-b1b6-45fb-af69-3087b553c644

All that is happening here is that receiver (the main object in this case) is loaded and checked against the version that the method expects. If the receiver wasn’t what was expected, the method that foo dispatches to might be different so we must check this (we call a check like that a ‘guard’). That’s the If at 1952. If that test fails we deoptimise. If the test passes (it will in this benchmark) then we return the constant value 16. So that’s just a couple of instructions.

Let’s compare this to the case where we give block_given? a block. This involves a Ruby Proc for the block, yielding, arguments and an extra method, so lots of extra abstraction, but as everything is constant that should hopefully all compile away. This is the graph:

ab38c33c-9086-4750-871d-5b22bcf999f9

I’ve had to zoom out because it’s larger, and it looks much more complicated, but follow the path that is taken if everything is as expected. It’s a null check with two if's based on it, and then the same receiver check, but then the return is the same, returning the constant value 16 without any calculation needed beyond those null checks. So there isn’t really much more there when the program is running as expected - maybe an extra test or two which will be accurately predicted.

We can also look at the machine code that comes out of Graal. That’s probably going a bit far, but I did look at the size of the generated code and it’s 140 bytes without a block, and 227 with.

$ jt run --graal -J-Djvmci.option.TraceTruffleCompilation=true test.rb
[truffle] opt done         block in loop:test.rb:9 <opt> <split-0-U>                   |ASTSize      11/   43 |Time   339( 291+48  )ms |DirectCallNodes I    3/D    0 |GraalNodes    44/   16 |CodeSize          140 |Source      test.rb:9
[truffle] opt done         block in loop:test.rb:9 <opt> <split-0-U>                   |ASTSize      14/   82 |Time   479( 429+51  )ms |DirectCallNodes I    4/D    0 |GraalNodes    86/   52 |CodeSize          227 |Source      test.rb:9

All that together makes me confident that block_given? performs the same with and without a block. So why did my benchmark-ips benchmark give such a misleading result?

As an aside - this shows you how meaningless a reported ‘error’ can be. benchmark-ips told us it was confident to within 1% that the true speed of block_given? without a block was half the performance of with a block. That’s slightly unfair of me as this report really has the implicit caveats of ‘with the code you gave me, as I ran it within my harness’, but you get the idea.

There are aspects of the design of benchmark-ips, and the benchmark that I wrote, that interact badly with the optimisations JRuby+Truffle applies.

First of all, benchmark-ips runs each benchmark in the same VM. That means that the order of the benchmarks matters, so when you test benchmarks A and then B, you are really testing A independently of B and then B, given A already having run.

You can see the impact that this has on my benchmark in the method which calls block_given?

def call_bg
  if block_given?
    yield 14, 2
  else
    14 + 2
  end
end

The first benchmark runs with a block. JRuby+Truffle monitors the if statement in this method and sees that only the first branch is ever taken. Therefore it compiles something a bit like this, with a deoptimise if a block wasn't given and the code it compiled is therefore no longer valid.

def foo
  deoptimise! unless block_given?
  yield 14, 2
end

When the second benchmark runs without a block, that deoptimise is taken, and JRuby+Truffle now knows that both branches can be taken, so it compiles the full method, which is slower. This can be fixed by defining two versions of the call_bg method, but if you are running lots of library code that’s really hard to do. This is why I always benchmark with a fresh VM for each benchmark.

There are two other flaws in benchmark-ips which cause problems. The way that benchmark-ips works is in two phases. A warmup phase allows time for a JIT or other optimiser to run, and also works out how fast the code is so that the later timing loop can scale the number of iterations it runs so it doesn’t check the clock too often (we don’t want to spend all our time checking the clock as we’re not interested in how fast that is). Then the timing phase runs, with this detected number of iterations.

The first problem with this is that all the iterations from the warmup phase are used to determine how fast the benchmark probably runs at full speed. That doesn’t make any sense, as it will include the non-optimised cold code iterations. They aren't relevant to how fast the optimised code runs.

A related issue is that for very small benchmarks like block_given?, timing a single iteration is never going to give any useful information. If block_given? takes a couple of instructions, checking the clock is going to take a context switch and thousands of instructions and may trash your cache.

The second problem with this is that during warmup the benchmark is run with just one iteration each time. When timing starts it uses the number of iterations it worked out during warmup. JRuby+Truffle applies an optimisation to method parameters called value profiling. It keeps track of what argument values it has seen for each parameter, and if it only ever sees one value it turns that value into a constant. This is a very powerful optimisation that allows you to add extra abstraction to your Ruby program such configuration parameters, but not pay of them unless they actually change. This interacts badly with benchmark-ips where it assumes only one iteration will ever be run. As soon as the timing run starts the number changes for the first time. It’s like the goal posts move as soon as the game starts, and JRuby+Truffle immediately deoptimises right as timing starts!

We can confirm this by running JRuby+Truffle with options to print information about when a method is compiled and when it is deoptimised. The full output is verbose but it’s in this gist. For example look at line 111 where it starts to time the second benchmark. Immediately you can see a deoptimisation caused by a value profile cache miss.

So what is the solution to this? I think benchmark-ips is just a flawed approach to benchmarking. I think it’s fundamentally not correct to test benchmarks sequentially in the same VM, even though it is certainly convenient. I also think it’s a great idea to try to automatically scale benchmarks, but I tried this many times during my PhD and didn’t manage to ever really work it out. The people who have spent the most time thinking about this still recommend a manual approach. I don’t think benchmark-ips has solved this. In my opinion the only correct way to benchmark is like this:

loop do
  start = Time.now
  my_benchmark # should take around a second on the best implementation
  puts Time.now - start
end

Run that for around 60s, then sample for another 60s and summarise with a mean and standard error (or even better a bootstrap confidence interval).

This is basically what bench9000 does, but with a little stuff to make it easier to use. You still need to manually scale the benchmark, and there are other issues about whether or not you want to stop things optimising away, and if you do how do you manage that. Benchmarking is a bottomless problem.

I tried modifying benchmark-ips, to run warmup several times. First it runs the benchmark with one, two and then three iterations to stop JRuby+Truffle value profiling. Then it runs to just let the JIT work, then it runs with one iteration to get an initial guess for scale. Then it runs again with this scale to get a second, more accurate scale. Then it runs for a while, with the final scale, before it starts timing. Obviously that’s really slow, but it does solve the problem:

Calculating -------------------------------------
     b_g? with block    13.641M i/100ms
  b_g? without block    13.352M i/100ms
-------------------------------------------------
     b_g? with block    146.064M (± 2.0%) i/s -      4.379B
  b_g? without block    152.336M (± 1.8%) i/s -      4.566B

I put this into a shim file which you can require after requiring benchmark/ips.

So that’s why block_given? looked different with and without a block, and what needs to done to make that clear in benchmarks. I feel a bit pained criticising a third party tool as my philosophy is always as far as possible that Ruby code is what it is and I want to run and make it fast however it comes. It doesn’t look good blaming a benchmark harness for making you look slow, but hopefully I’ve explained why in enough depth.

I’m now going to use these modifications to benchmark-ips to look at the new playout micro benchmarks.

@brauliobo
Copy link

@chrisseaton wow, thanks for this insighful explanation. don't you think you should open a PR to benchmark-ips?

@chrisseaton
Copy link
Contributor

Maybe when I've finished this performance investigation. I really don't want to have to patch or send PRs to any third party software though - but not sure what other options are.

@headius
Copy link
Member

headius commented Oct 4, 2015

I consider benchmark-ips to be a good solution when the interaction between the benchmarks is either minimal or interesting. In this case, it is interesting but not minimal, and the order the benchmarks run is a problem.

This does not invalidate benchmark-ips for benchmarking. I usually run benchmark-ips in a loop, not relying on it to settle during the first run. And generally when there are interactions between the benchmarks, I find that interesting because a typical application will not just test e.g. block_given? in a loop.

In this case, the benchmark shows there are important interactions between the two paths. A method with block_given? will generally have two sets of behavior. The presence of block_given? almost always means the same method gets called with AND without a block, and it would be important for an implementation to handle both cases well. If the truffle optimizations of block_given? only work well when the method is ONLY called with a block or ONLY called without it, that would be bad.

So, while I agree benchmark-ips gives misleading results in cases where the benchmarks have interactions, it does demonstrate that those interactions exist. You can't show that with isolated VM benchmarks (unless they're written specifically to do so).

@chrisseaton
Copy link
Contributor

Those are valid points. Running the whole thing in a loop makes sense. And some of these problems do not apply when you benchmarking something more substantial than block_given?.

I would just make it clear that our optimisation of block_given? doesn't only work well when only one branch is taken. It's even constant in both cases as long as the call site block (or absence of) is constant. It's just slower when it has to do both cases (which makes sense). And each call site of block_given? optimises the branches from scratch (via our 'splitting' technique which copies the method), so having two different call sites also fixes that problem.

@chrisseaton
Copy link
Contributor

Fixed a few more things:

  • We were missing some 'explode' annotations on the hash literal 43278c6, which means expressions in it were running in a loop, rather than a sequence of instructions
  • We were using an old shortcut for calling .dup.freeze on string keys in a non-identity hash map dffe5aa
  • The same thing for .frozen? when setting the default value in a hash bbfadac

Fixing these means we're faster than MRI on all the micro benchmarks, when running with our shim.

We're also faster than all others on the overall 19x19 board full playout and score! But only when running with our shim and with a 1G heap instead of the default 500MB. You won't see the same results running in unmodified benchmark-ips. This is something we need to fix and we'll be working on over the next year.

We should try with your new branch, as at the moment I think we're limited in how fast we can go by allocation. The Board#each I think is a real killer.

I'll mark this as closed for now, as the only remaining issue (warmup time) is a more general issue we know about.

MRI:

9x9 full playout (+ score)
                        144.900  (±10.4%) i/s -      4.312k
13x13 full playout (+ score)
                         69.123  (±10.1%) i/s -      2.052k
19x19 full playout (+ score)
                         29.631  (±13.5%) i/s -    874.000

Rubinius:

9x9 full playout (+ score)
                        128.682  (± 9.3%) i/s -      3.835k
13x13 full playout (+ score)
                         59.765  (±10.0%) i/s -      1.775k
19x19 full playout (+ score)
                         27.298  (±11.0%) i/s -    810.000 

JRuby:

9x9 full playout (+ score)
                        201.599  (± 8.4%) i/s -      6.020k
13x13 full playout (+ score)
                         93.030  (± 9.7%) i/s -      2.768k
19x19 full playout (+ score)
                         40.283  (±12.4%) i/s -      1.194k

JRuby+Truffle

9x9 full playout (+ score)
                        501.671  (± 13.4%) i/s -     14.600k
13x13 full playout (+ score)
                        226.016  (± 11.9%) i/s -      6.680k
19x19 full playout (+ score)
                        105.044  (± 11.4%) i/s -      3.105k

@PragTob
Copy link
Author

PragTob commented Oct 7, 2015

@chrisseaton thanks so much, awesome work!

By your shim you mean "jruby+truffle" right?

Can I see your patch to benchmark-ips somewhere? Would like to try that out. Also for the real overall benchmark I was looking to write a little benchmarking tool that just called a block n times (with some for warmup) and then get the average execution time out of it, so maybe there won't be any bad interactions there. Sounds like a good idea?

The new branch/representation is on master along with some other performance fixes (for me CRuby is ~60% faster there). By accident I broke the benchmarks in the latest thing I did though... plan is to fix them up, port the micro benchmarks from your fork over and use the truffle tool for things and then get back to you (could take me until next week, could be earlier). Sounds good to you? :)

Thanks a lot, great work!

@chrisseaton
Copy link
Contributor

By shim I mean this file, but 'jruby+truffle' is also useful for handling the gems and things. Require it after benchmark-ips. It does make benchmarking take a long time, but it fixes the issues I saw with benchmark-ips.

Yes your plan sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants