[Truffle] VM unexpectedly "cold" running the same methods (benchmarking) #3387

PragTob · 2015-10-11T16:27:35Z

Me again ;)

At rubykon (currently still on a branch) I started writing a very simple benchmarking tool to benchmark the final use case, which is expected to take multiple seconds to finish. Its code lives in lib/benchmark/avg. It runs the given block several times (until a time threshold is exceeded) to warm up and then runs it until a time threshold is reached again ("real" benchmark). I currently have debug output that prints the times of individual samples, which made me notice that after the warmup truffle is basically as slow as it was at the beginning of the warmup.

This report is as of this commit

The behavior I'm seeing is the following:

tobi@tobi-desktop ~/github/rubykon $ jruby+truffle run --graal -J-Xmx1500m benchmark/mcts_avg.rb 
$ JAVACMD=../graalvm-jdk1.8.0/bin/java ../jruby/bin/jruby -X\+T -J-Xmx1500m -Xtruffle.core.load_path\=../jruby/truffle/src/main/ruby -r ./.jruby\+truffle_bundle/bundler/setup.rb benchmark/mcts_avg.rb
Warming up...
--------------------------------------------------------------------------------
[96.148, 16.649]
19x19 1_000 iterations        1.06 i/min  56.4 s (avg) (± 70.48%)

Running the benchmark...
--------------------------------------------------------------------------------
[95.89, 20.925]
19x19 1_000 iterations        1.03 i/min  58.41 s (avg) (± 64.17%)

If you look at the arrays printed there for debug, those are the sample times in seconds. It starts at 96 seconds in the warmup, then 16s. When it goes to the "real" benchmark it starts at 95s again, as if the VM were cold. Overall it even performs worse in the "real" benchmark than the warmup, in this particular instance.

The code calls the same methods and calls the same block, which is what baffles me. I.e. benchmarking always eventually ends up at:

      def measure_block(samples, to_benchmark)
        start = Time.now
        to_benchmark.call
        finish = Time.now
        samples << (finish - start)
      end

The block called here runs a full Monte Carlo Tree Search with 1_000 playouts, so that is also already really repetitive.

Even the entry point for warmup and benchmark is the same though (type passed in is :warmup or :time):

      def run(type)
        puts BENCHMARK_DESCRIPTION[type]
        divider
        time = @options[type]
        run_benchmarks(time)
        puts
      end

The only difference is the banner printed to STDOUT and the total time it gets to perform computations (I also tried using the same time, same result).

I'm aware that this behavior might be triggered by the same thing that @chrisseaton patched in the benchmark-ips gem, but am unsure what to do about it as it really baffles me. My understanding is probably not good enough.

Things that seem to make it better is running the 9x9 benchmark along with it (and not just the 19x19 benchmark):

--------------------------------------------------------------------------------
[43.902, 15.013, 6.463, 9.926, 8.11, 12.158, 2.926, 2.545, 2.488, 2.747, 1.48, 1.361, 1.188]
9x9 1_000 iterations          7.07 i/min  8.49 s (avg) (± 130.95%)
[142.772]
19x19 1_000 iterations        0.42 i/min  142.77 s (avg) (± 0.0%)

Running the benchmark...
--------------------------------------------------------------------------------
[1.82, 1.554, 1.387, 1.813, 1.626, 1.343, 1.454, 1.985, 4.712, 5.154, 1.499, 1.054, 1.282, 1.305, 1.521, 1.034, 1.276, 1.172, 1.4489999999999998, 1.053, 0.914, 0.891, 0.848, 0.8470000000000001, 0.863, 0.884, 0.875, 2.248, 5.655, 5.729, 5.742, 5.735, 4.191, 4.296, 4.525, 5.443, 5.186, 4.338, 5.028, 4.104]
9x9 1_000 iterations          23.11 i/min  2.6 s (avg) (± 69.98%)
[31.766, 12.041, 7.499, 6.619, 5.739, 6.918, 18.809, 14.977]
19x19 1_000 iterations        4.6 i/min  13.05 s (avg) (± 63.64%)

What also makes it better is just running the benchmarking/time phase twice:

tobi@tobi-desktop ~/github/rubykon $ jruby+truffle run --graal -J-Xmx1500m benchmark/mcts_avg.rb 
$ JAVACMD=../graalvm-jdk1.8.0/bin/java ../jruby/bin/jruby -X\+T -J-Xmx1500m -Xtruffle.core.load_path\=../jruby/truffle/src/main/ruby -r ./.jruby\+truffle_bundle/bundler/setup.rb benchmark/mcts_avg.rb
Warming up...
--------------------------------------------------------------------------------
[101.404, 13.719]
19x19 1_000 iterations        1.04 i/min  57.56 s (avg) (± 76.17%)

Running the benchmark...
--------------------------------------------------------------------------------
[139.456]
19x19 1_000 iterations        0.43 i/min  139.46 s (avg) (± 0.0%)

Running the benchmark...
--------------------------------------------------------------------------------
[27.285, 20.432, 10.405, 9.472, 10.443, 10.77, 10.318, 6.666]
19x19 1_000 iterations        4.54 i/min  13.22 s (avg) (± 49.07%)

What to do here? Not sure. A hint what else I could do to make it work with warmup would be nice, or well making the VM not get cold here, although that's probably way more complicated than I can imagine :)

Thanks for all the work!
Tobi

The text was updated successfully, but these errors were encountered:

chrisseaton · 2015-10-11T18:20:01Z

In general terms, the problem here is the same as before. You're running the interpreter in one state while warming up, and then in another state while measuring. The key method you are both warming up and measuring is the same, but you still warm up, do some other stuff like print to the screen, and then start measuring. That bit between does things that changes the state of the interpreter.

Running with -J-G:+TraceTruffleCompilation -J-G:+TraceTruffleTransferToInterpreter -J-G:+TraceTruffleAssumptions (https://gist.github.com/chrisseaton/e312438e4dc3c92fbdf0), you can see that something causes a lot of deoptimisation right after the warmup and right before measurement (https://gist.github.com/chrisseaton/e312438e4dc3c92fbdf0#file-gistfile1-txt-L416-L457). It all seems to revolve around Numeric#eql? from Rubinius (https://github.com/jruby/jruby/blob/master/truffle/src/main/ruby/core/rubinius/common/numeric.rb#L48). This is rewritten, which causes an avalanche of deoptimisations as it is inlined into so many things.

I didn't go far enough to figure out exactly what is causing Numeric#eql? to rewrite, but I can see that the type of local variables is changing, so it looks like something you run between warmup and measurement (probably within a method, within a method, within something in the core library etc) happens to call Numeric#eql? with a new type, maybe Bignum or something, or a long Fixnum instead of an int one, and that's where it all starts to go wrong.

Normally for key routines like that we always 'split', or generate a new copy of the method for each call site, but we don't do that for Rubinius code at the moment.

You are seeing one downside of aggressive optimisation, in that when the program changes the way it runs everything can unravel for a while. The key problem is that you are causing everything to unravel right as you start measuring.

How about run a single loop, measuring the time of iterations for say 2 minutes, so you get an array of samples. Then, remove the first 60s of samples. Then warmup and iteration is one unbroken action with no upheaval between.

PragTob · 2015-10-12T07:20:30Z

Thanks for the explanation @chrisseaton !

It seems like/I bet it is the printing of results - Various number comparisons are invoked there (just like over at benchmark-ips), rounded and then printed out. It should just be additions, divisions etc. so no idea why #eql? is called but it probably happens somewhere there. code.

Thanks!

Should I leave this open for further investigation on the Truffle said or close because I got the answer I needed and can now work around it?

chrisseaton · 2015-10-12T09:54:02Z

I can't see exactly what is calling #eql?, but in the end it doesn't really matter. If we fixed that there might be something else that gets rewritten by the code between warmup and measurement.

I'll close this, just with the same advice that any code between warmup and measurement is bad. I think most experienced VM researchers you talk to would agree with that. I don't think we want to 'fix' anything in JRuby+Truffle here unfortunately.

eregon · 2015-10-28T18:26:47Z

Just for info, extract_times creates a Hash and adds multiple pairs to it, that calls eql? on the keys.

chrisseaton added the truffle label Oct 11, 2015

chrisseaton added this to the truffle-dev milestone Oct 11, 2015

chrisseaton self-assigned this Oct 11, 2015

chrisseaton closed this as completed Oct 12, 2015

enebo added this to the Non-Release milestone Dec 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

[Truffle] VM unexpectedly "cold" running the same methods (benchmarking) #3387

[Truffle] VM unexpectedly "cold" running the same methods (benchmarking) #3387

PragTob commented Oct 11, 2015

chrisseaton commented Oct 11, 2015

PragTob commented Oct 12, 2015

chrisseaton commented Oct 12, 2015

eregon commented Oct 28, 2015

[Truffle] VM unexpectedly "cold" running the same methods (benchmarking) #3387

[Truffle] VM unexpectedly "cold" running the same methods (benchmarking) #3387

Comments

PragTob commented Oct 11, 2015

chrisseaton commented Oct 11, 2015

PragTob commented Oct 12, 2015

chrisseaton commented Oct 12, 2015

eregon commented Oct 28, 2015