-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Truffle] Unexpectedly very slow? #3355
Comments
This looks like it could be a fantastic little benchmark, so I'd be very happy to take a look at it in Truffle and figure out why it's slow. Thanks for letting us know. |
First of all, I want to run your specs in JRuby+Truffle, as there's no use benchmarking something that is broken. You are right that we don't support RubyGems very well yet, but we have a tool to help work around this. If you get the source code for JRuby, you can do:
To install a small tool,
This gets normal JRuby to help us install the gems and patch a couple of things we don't run correctly yet in common gems. You can then run your specs:
That's a very good start. We can run all of your specs, off-the-shelf. Tomorrow I'll start to look at why it isn't very fast. I don't think we have |
👍 thanks @chrisseaton that's some very useful information, should maybe also go into the docs? :) |
I'll keep a running log here of what I'm doing. Maybe it will turn into an interesting case study. The first thing I've done is improve our random number generator primitives, which your benchmark uses. These are used by the Rubinius code we're re-using, and are implementing using the same backend as JRuby. Before they weren't quite right for compilation and weren't producing the same numbers as other implementations for the same seed. I've fixed both of those issues in 88c38ca80c021566bf98e80633951670312478f5. |
Then I added an option to our
|
I started with the simpler of the benchmarks,
Using this I spotted a couple of errors. Our code for reading the time zone, which is used in the
Secondly, some allocations were using a 'shape' for objects that was not constant and so couldn't be compiled efficiently.
These problems are fixed in db649ae. We're still slower than MRI though, so the investigation continues. |
My next step is to break down the code path into some micro-benchmarks and test them individually until we find the problem. My first suspicion was jruby/truffle/src/main/ruby/core/rubinius/common/array.rb Lines 1963 to 1998 in 89ee83c
block_given? is the main problem (#3322), as this currently causes frames to be materialised (turned into heap objects so they can be accessed by the block_given? method) and I thought this method would be using that somehow. However it turns out this method is fine.
board_19 = Rubykon::Board.new 19
benchmark.report '19x19 each' do
board_19.each do |stone|
end
end
There we are running at 92k, compared to JRuby at 18k, and MRI at 24k. So that is a good sign - we are running parts of this faster than anyone else. Just looking for the problem still... |
Within that loop the only thing that looks immediately interesting is Benchmark.ips do |benchmark|
benchmark.config time: 30, warmup: 60
board = Rubykon::Board.new 19
cutting_point = board[8, 8]
scorer = Rubykon::GameScorer.new
benchmark.report 'score_empty_cp' do
game_score = {Rubykon::Board::BLACK_COLOR => 0, Rubykon::Board::WHITE_COLOR => Rubykon::Game::DEFAULT_KOMI}
scorer.score_empty_cutting_point(cutting_point, board, game_score)
end
benchmark.report 'neighbour_c_of' do
board.neighbour_colors_of(cutting_point.x, cutting_point.y)
end
neighbour_colors = board.neighbour_colors_of(cutting_point.x, cutting_point.y)
benchmark.report 'find_cand_c_of' do
scorer.find_candidate_color(neighbour_colors)
end
candidate_color = scorer.find_candidate_color(neighbour_colors)
benchmark.report 'only_one_c_adj?' do
scorer.only_one_color_adjacent?(neighbour_colors, candidate_color)
end
end
It looks like Along the way I fixed another bug, in e139d9c. |
Thank you very much for the running journal @chrisseaton - it's a very interesting read for me. Thanks for all the hard work 👍 |
Benchmarking def foo
if block_given?
yield 14, 2
else
14 + 2
end
end
Benchmark.ips do |benchmark|
benchmark.config time: 30, warmup: 60
benchmark.report 'b_g? with block' do
foo do |a, b|
a + b
end
end
benchmark.report 'b_g? without block' do
foo
end
end
|
I've optimised Repeating that same benchmark:
In the best case, that's 1700x faster than before, and now 26x faster than MRI, which is much better. It's also orders of magnitude faster than both Rubinius and JRuby in the base cases.
Even better than that, Let's talk in some depth about what
In MRI and implementations like JRuby's interpreter mode this is easy, as stack frames are a linked list of heap-allocated objects, and the block of the caller frame is easy to access. In an optimising implementation it becomes much more complicated because they try to use the native stack to store Ruby stack frames. It can be easy to put things on the native stack, but then languages like C, C++ and Java don't provide any easy way to read non-local stack frames so it's hard to get things back off it unless you're in the same method. In our previous implementation of Our new implementation of You may be surprised that we don't have a hint to inline already, but this is because in most cases we can just let Truffle to do the right thing for us. Here we do know better so I've added that hint. I'll continue looking at other parts of your benchmark tomorrow, but I'm hopeful this and maybe a few other little fixes will make the whole thing fast. |
This solved the problem in the scoring micro-benchmarks.
And in the scoring benchmarks as a whole.
We're now 7.5x faster than MRI at scoring. I'll look at the calls that go into the playout benchmarks next and carefully check all the parts of it in the same way. It looks like scoring should be a fairly small component of the total runtime for the overall benchmark, so it's important to check the rest just as carefully. |
@PragTob you couldn't help me by writing some more micro benchmarks for parts of playout could you? I have much of it as micro benchmarks, but I've got placeholders for them in my fork - could you pull that and fix? But you may not want to pull this into your master as it has some random other changes. An important point is we need to reset the state between each iteration - so the benchmark should not create larger and larger groups each iteration for example. |
Hmm... I think we're interacting badly with require_relative '../lib/rubykon'
class Rubykon::RandomPlayout
public :generate_random_move
end
game = Rubykon::Game.new 19
board = game.board
playout = Rubykon::RandomPlayout.new
loop do
start = Time.now
10_000.times do
playout.generate_random_move(game)
end
puts Time.now - start
end We are then 10x faster than MRI, when with |
@chrisseaton thanks for all the work. I'm back home from work, I'll go through emails and stuff and then give the benchmarks a look. :) |
@chrisseaton added the missing methods over in chrisseaton/rubykon#1 - hope it helps. FYI I have work on a branch that makes the Stone class disappear (and that sort of setup work a bit easier, as the board and stone representation is way more lightweight). Doesn't have a lot of impact on perf in ebnchmark-ips (as it runs the GC inbetween test runs I believe). Also moving forward to not so silly move generation. But that will land on Friday/Saturday at the earliest (tomorrow is ruby user group Berlin time!) Thanks for your great work on this :) |
@chrisseaton any clue why is that |
Yeah, I was curious as well, it is really counter intuitive that a block-less call would be slower. On the face of it going through a block should surely be more work. |
Reference |
Ok, let’s resolve the issue of I have my own benchmark harness, developed as part of my doctoral studies over the last few years, Using my benchmark harness it looks like there is no statistically significant difference between the performance of Benchmarking is always going to be non-deterministic as it involves measuring a wall clock. We can get some more certain by looking at what the compiler is outputting, rather than how fast it runs when we measure it. Graal has a feature where it can show you the graph it uses to compile Truffle programs (and Java programs). I looked at the graph at the latest possible moment, after all the optimisations were applied and all the abstractions of Ruby were hopefully removed, right before machine instructions are emitted. def foo
if block_given?
yield 14, 2
else
14 + 2
end
end
loop do
foo
end
The graph for All that is happening here is that receiver (the main object in this case) is loaded and checked against the version that the method expects. If the receiver wasn’t what was expected, the method that Let’s compare this to the case where we give I’ve had to zoom out because it’s larger, and it looks much more complicated, but follow the path that is taken if everything is as expected. It’s a null check with two if's based on it, and then the same receiver check, but then the return is the same, returning the constant value 16 without any calculation needed beyond those null checks. So there isn’t really much more there when the program is running as expected - maybe an extra test or two which will be accurately predicted. We can also look at the machine code that comes out of Graal. That’s probably going a bit far, but I did look at the size of the generated code and it’s 140 bytes without a block, and 227 with.
All that together makes me confident that As an aside - this shows you how meaningless a reported ‘error’ can be. There are aspects of the design of First of all, You can see the impact that this has on my benchmark in the method which calls def call_bg
if block_given?
yield 14, 2
else
14 + 2
end
end The first benchmark runs with a block. JRuby+Truffle monitors the def foo
deoptimise! unless block_given?
yield 14, 2
end When the second benchmark runs without a block, that deoptimise is taken, and JRuby+Truffle now knows that both branches can be taken, so it compiles the full method, which is slower. This can be fixed by defining two versions of the There are two other flaws in The first problem with this is that all the iterations from the warmup phase are used to determine how fast the benchmark probably runs at full speed. That doesn’t make any sense, as it will include the non-optimised cold code iterations. They aren't relevant to how fast the optimised code runs. A related issue is that for very small benchmarks like The second problem with this is that during warmup the benchmark is run with just one iteration each time. When timing starts it uses the number of iterations it worked out during warmup. JRuby+Truffle applies an optimisation to method parameters called value profiling. It keeps track of what argument values it has seen for each parameter, and if it only ever sees one value it turns that value into a constant. This is a very powerful optimisation that allows you to add extra abstraction to your Ruby program such configuration parameters, but not pay of them unless they actually change. This interacts badly with We can confirm this by running JRuby+Truffle with options to print information about when a method is compiled and when it is deoptimised. The full output is verbose but it’s in this gist. For example look at line 111 where it starts to time the second benchmark. Immediately you can see a deoptimisation caused by a value profile cache miss. So what is the solution to this? I think loop do
start = Time.now
my_benchmark # should take around a second on the best implementation
puts Time.now - start
end Run that for around 60s, then sample for another 60s and summarise with a mean and standard error (or even better a bootstrap confidence interval). This is basically what I tried modifying
I put this into a shim file which you can require after requiring So that’s why I’m now going to use these modifications to |
@chrisseaton wow, thanks for this insighful explanation. don't you think you should open a PR to |
Maybe when I've finished this performance investigation. I really don't want to have to patch or send PRs to any third party software though - but not sure what other options are. |
I consider benchmark-ips to be a good solution when the interaction between the benchmarks is either minimal or interesting. In this case, it is interesting but not minimal, and the order the benchmarks run is a problem. This does not invalidate benchmark-ips for benchmarking. I usually run benchmark-ips in a loop, not relying on it to settle during the first run. And generally when there are interactions between the benchmarks, I find that interesting because a typical application will not just test e.g. In this case, the benchmark shows there are important interactions between the two paths. A method with So, while I agree benchmark-ips gives misleading results in cases where the benchmarks have interactions, it does demonstrate that those interactions exist. You can't show that with isolated VM benchmarks (unless they're written specifically to do so). |
Those are valid points. Running the whole thing in a loop makes sense. And some of these problems do not apply when you benchmarking something more substantial than I would just make it clear that our optimisation of |
Fixed a few more things:
Fixing these means we're faster than MRI on all the micro benchmarks, when running with our shim. We're also faster than all others on the overall 19x19 board full playout and score! But only when running with our shim and with a 1G heap instead of the default 500MB. You won't see the same results running in unmodified benchmark-ips. This is something we need to fix and we'll be working on over the next year. We should try with your new branch, as at the moment I think we're limited in how fast we can go by allocation. The I'll mark this as closed for now, as the only remaining issue (warmup time) is a more general issue we know about. MRI:
Rubinius:
JRuby:
JRuby+Truffle
|
@chrisseaton thanks so much, awesome work! By your shim you mean "jruby+truffle" right? Can I see your patch to The new branch/representation is on master along with some other performance fixes (for me CRuby is ~60% faster there). By accident I broke the benchmarks in the latest thing I did though... plan is to fix them up, port the micro benchmarks from your fork over and use the truffle tool for things and then get back to you (could take me until next week, could be earlier). Sounds good to you? :) Thanks a lot, great work! |
By shim I mean this file, but 'jruby+truffle' is also useful for handling the gems and things. Require it after benchmark-ips. It does make benchmarking take a long time, but it fixes the issues I saw with benchmark-ips. Yes your plan sounds good. |
Hi there thanks for all your work :)
I have a little pure ruby project in the making ( rubykon ) . It is CPU bound and performance critical so I wanted to see truffle/graal shine. However, it is super slow for some reason. It is ~10 times slower than the slowest other ruby implementation that I tested (1.9.3 that is). And almost 25 times slower than the fastest (jruby 9k). You can see this file for benchmark results - the most important is the 19x19 full playout.
The benchmark uses 5 seconds warmup and 20 seconds execution time. It's all the same code (different input sizes) - so when it hits the last important benchmark it already had 55 seconds of warmup. I also tried larger warmup times (20 and 30 I believe), but it didn't help, it only blew up the heap.
I use the graal nightly from today:
Also
-Xtruffle.printRuntime=true
does not seem to work for me.Btw. the current implementation is very naive and will hopefully get faster soon as various aspects are optimized (less object allocations etc.), however this seemed very strange and I thought you might be interested. This version is tagged
0.1
.Ah yes, benchmark is done using the
benchmark-ips
gem that I put in a ignored folder so that graal can load it without gem install :)Hope this helps! :)
Tobi
The text was updated successfully, but these errors were encountered: