JRuby 9.0.1.0 has slower $global_var usage #3350

ghost · 2015-09-25T15:06:45Z

Using a (very simple) benchmark to test how JRuby handles differing scopes for variables, I found that unlike MRI 2.2.2, JRuby 9.0.1.0 was performing far slower than other variable types / scopes with globals.

The test code is as follows:

require 'benchmark'

LOOPS = 30000000

$global = 0
@outer_inst = 0
outer_scope = 0

Benchmark.bmbm {|test|
    @inner_inst = 0
    inner_scope = 0

    test.report('$global'){ LOOPS.times { $global += 1 } }
    test.report('@outer_inst'){ LOOPS.times { @outer_inst += 1 } }
    test.report('outer_scope'){ LOOPS.times { outer_scope += 1 } }
    test.report('@inner_inst'){ LOOPS.times { @inner_inst += 1 } }
    test.report('inner_scope'){ LOOPS.times { inner_scope += 1 } }
    test.report('local_scope'){ local_scope = 0; LOOPS.times { local_scope += 1 } }
}

The results for JRuby:
jruby 9.0.1.0 (2.2.2) 2015-09-02 583f336 Java HotSpot(TM) 64-Bit Server VM 25.51-b03 on 1.8.0_51-b16 +jit [Windows 7-amd64]

Rehearsal -----------------------------------------------
$global       6.167000   0.000000   6.167000 (  6.167375)
@outer_inst   2.830000   0.000000   2.830000 (  2.829050)
outer_scope   2.777000   0.000000   2.777000 (  2.776937)
@inner_inst   3.118000   0.000000   3.118000 (  3.117344)
inner_scope   2.631000   0.000000   2.631000 (  2.630262)
local_scope   2.297000   0.000000   2.297000 (  2.297344)
------------------------------------- total: 19.820000sec

                  user     system      total        real
$global       6.164000   0.000000   6.164000 (  6.162756)
@outer_inst   3.078000   0.000000   3.078000 (  3.077359)
outer_scope   2.360000   0.000000   2.360000 (  2.359673)
@inner_inst   2.614000   0.000000   2.614000 (  2.614211)
inner_scope   2.384000   0.000000   2.384000 (  2.384571)
local_scope   2.462000   0.000000   2.462000 (  2.461775)

And MRI:
ruby 2.2.2p95 (2015-04-13 revision 50295) [x64-mingw32]

Rehearsal -----------------------------------------------
$global       5.086000   0.000000   5.086000 (  5.099396)
@outer_inst   5.272000   0.000000   5.272000 (  5.281465)
outer_scope   4.821000   0.016000   4.837000 (  4.832652)
@inner_inst   4.305000   0.000000   4.305000 (  4.335950)
inner_scope   4.727000   0.000000   4.727000 (  4.745273)
local_scope   5.101000   0.000000   5.101000 (  5.102266)
------------------------------------- total: 29.328000sec

                  user     system      total        real
$global       4.759000   0.000000   4.759000 (  4.764358)
@outer_inst   5.335000   0.000000   5.335000 (  5.348511)
outer_scope   5.179000   0.000000   5.179000 (  5.180154)
@inner_inst   5.429000   0.000000   5.429000 (  5.503256)
inner_scope   4.929000   0.000000   4.929000 (  4.937495)
local_scope   4.556000   0.015000   4.571000 (  4.567554)

It seems that while MRI is fairly normalized across the board, JRuby only shows relatively equal times for local, inner, and instance variable scopes. In particular, JRuby's time for globals is ~226% of the average of all other tested scopes.

The good news, however, is that JRuby's jitting is definitely resulting in faster overall performance.

The text was updated successfully, but these errors were encountered:

headius · 2015-09-25T19:19:52Z

The perf impact here seems to be mostly the writes; I split your benchmark into reads and writes and global var reads are about the same perf as all the others (and probably lost in the noise of the loop):

9k:

                        user     system      total        real
$global read        1.170000   0.010000   1.180000 (  1.198987)
@outer_inst read    1.020000   0.010000   1.030000 (  0.957400)
outer_scope read    1.150000   0.010000   1.160000 (  1.186650)
@inner_inst read    0.970000   0.000000   0.970000 (  0.968351)
inner_scope read    1.080000   0.010000   1.090000 (  1.055287)
local_scope read    0.980000   0.000000   0.980000 (  0.977480)
$global write       2.450000   0.030000   2.480000 (  2.404311)
@outer_inst write   1.170000   0.010000   1.180000 (  1.178349)
outer_scope write   1.180000   0.000000   1.180000 (  1.188390)
@inner_inst write   1.160000   0.000000   1.160000 (  1.163344)
inner_scope write   1.040000   0.000000   1.040000 (  1.044345)
local_scope write   1.050000   0.000000   1.050000 (  1.066801)

1.7.21:

                        user     system      total        real
$global read        1.190000   0.000000   1.190000 (  1.197000)
@outer_inst read    1.040000   0.000000   1.040000 (  1.043000)
outer_scope read    0.960000   0.000000   0.960000 (  0.967000)
@inner_inst read    1.050000   0.010000   1.060000 (  1.059000)
inner_scope read    0.990000   0.010000   1.000000 (  1.003000)
local_scope read    0.960000   0.000000   0.960000 (  0.968000)
$global write       2.380000   0.030000   2.410000 (  2.348000)
@outer_inst write   1.260000   0.010000   1.270000 (  1.259000)
outer_scope write   0.980000   0.000000   0.980000 (  0.985000)
@inner_inst write   1.300000   0.010000   1.310000 (  1.303000)
inner_scope write   1.000000   0.000000   1.000000 (  1.001000)
local_scope write   0.950000   0.000000   0.950000 (  0.954000)

Note, however, that 1.7 also has some invokedynamic optimizations for reads to make them faster. I suspect that's lost in the loop noise too, but here's those numbers:

1.7.21 with indy, first iteration

$global read        0.900000   0.010000   0.910000 (  0.654000)
@outer_inst read    1.150000   0.010000   1.160000 (  1.022000)
outer_scope read    0.950000   0.010000   0.960000 (  0.943000)
@inner_inst read    1.060000   0.010000   1.070000 (  1.029000)
inner_scope read    0.950000   0.000000   0.950000 (  0.942000)
local_scope read    0.860000   0.010000   0.870000 (  0.853000)

However, globals that are written repeatedly destroy this optimization (so we don't constantly try to keep rebinding it) and the second iteration goes back to the same perf:

1.7.21 with indy, second iteration

$global read        1.070000   0.010000   1.080000 (  1.017000)
@outer_inst read    1.000000   0.000000   1.000000 (  1.006000)
outer_scope read    0.970000   0.000000   0.970000 (  0.972000)
@inner_inst read    0.970000   0.000000   0.970000 (  0.965000)
inner_scope read    0.930000   0.010000   0.940000 (  0.933000)
local_scope read    0.900000   0.000000   0.900000 (  0.900000)

Ultimately 9k does not perform much differently than 1.7 (other than the missing indy optimization) but I think it's worth taking a few minutes to improve this.

ghost · 2015-09-25T19:22:21Z

This is refreshing to hear. I had been worried when first I ran this test;
glad to see it's only the writes that are slow.

Not sure why I didn't think to check that. Thank you!
On Sep 25, 2015 3:20 PM, "Charles Oliver Nutter" notifications@github.com
wrote:

The perf impact here seems to be mostly the writes; I split your benchmark
into reads and writes and global var reads are about the same perf as all
the others (and probably lost in the noise of the loop):

9k:
                    user     system      total        real
$global read 1.170000 0.010000 1.180000 ( 1.198987)
@outer_inst read 1.020000 0.010000 1.030000 ( 0.957400)
outer_scope read 1.150000 0.010000 1.160000 ( 1.186650)
@inner_inst read 0.970000 0.000000 0.970000 ( 0.968351)
inner_scope read 1.080000 0.010000 1.090000 ( 1.055287)
local_scope read 0.980000 0.000000 0.980000 ( 0.977480)
$global write 2.450000 0.030000 2.480000 ( 2.404311)
@outer_inst write 1.170000 0.010000 1.180000 ( 1.178349)
outer_scope write 1.180000 0.000000 1.180000 ( 1.188390)
@inner_inst write 1.160000 0.000000 1.160000 ( 1.163344)
inner_scope write 1.040000 0.000000 1.040000 ( 1.044345)
local_scope write 1.050000 0.000000 1.050000 ( 1.066801)

1.7.21:
                    user     system      total        real
$global read 1.190000 0.000000 1.190000 ( 1.197000)
@outer_inst read 1.040000 0.000000 1.040000 ( 1.043000)
outer_scope read 0.960000 0.000000 0.960000 ( 0.967000)
@inner_inst read 1.050000 0.010000 1.060000 ( 1.059000)
inner_scope read 0.990000 0.010000 1.000000 ( 1.003000)
local_scope read 0.960000 0.000000 0.960000 ( 0.968000)
$global write 2.380000 0.030000 2.410000 ( 2.348000)
@outer_inst write 1.260000 0.010000 1.270000 ( 1.259000)
outer_scope write 0.980000 0.000000 0.980000 ( 0.985000)
@inner_inst write 1.300000 0.010000 1.310000 ( 1.303000)
inner_scope write 1.000000 0.000000 1.000000 ( 1.001000)
local_scope write 0.950000 0.000000 0.950000 ( 0.954000)

Note, however, that 1.7 also has some invokedynamic optimizations for
reads to make them faster. I suspect that's lost in the loop noise too, but
here's those numbers:

1.7.21 with indy, first iteration

$global read 0.900000 0.010000 0.910000 ( 0.654000)
@outer_inst read 1.150000 0.010000 1.160000 ( 1.022000)
outer_scope read 0.950000 0.010000 0.960000 ( 0.943000)
@inner_inst read 1.060000 0.010000 1.070000 ( 1.029000)
inner_scope read 0.950000 0.000000 0.950000 ( 0.942000)
local_scope read 0.860000 0.010000 0.870000 ( 0.853000)

However, globals that are written repeatedly destroy this optimization (so
we don't constantly try to keep rebinding it) and the second iteration goes
back to the same perf:

1.7.21 with indy, second iteration

$global read 1.070000 0.010000 1.080000 ( 1.017000)
@outer_inst read 1.000000 0.000000 1.000000 ( 1.006000)
outer_scope read 0.970000 0.000000 0.970000 ( 0.972000)
@inner_inst read 0.970000 0.000000 0.970000 ( 0.965000)
inner_scope read 0.930000 0.010000 0.940000 ( 0.933000)
local_scope read 0.900000 0.000000 0.900000 ( 0.900000)

Ultimately 9k does not perform much differently than 1.7 (other than the
missing indy optimization) but I think it's worth taking a few minutes to
improve this.

—
Reply to this email directly or view it on GitHub
#3350 (comment).

This improves global read perf for #3350 but global variable writes still seem to be slower than MRI. I'm not exactly sure why but it probably has to do with doing a full lookup for our too- abstract wrapper around these values.

headius · 2015-10-02T16:56:31Z

With f2612a2, 9k should have the same perf for globals as 1.7.x. Why our globals are slower to write than MRI's...that's still an open question.

headius · 2017-03-03T07:05:14Z

Note to self: include modified benchmarks in the original report.

I suspect the cost of writing is that even after we've given up on caching the global value, we continue to invalidate a SwitchPoint for the indy logic in the global variable. This is certainly not free, because invalidating it means we also construct a new one. I think the right way to fix this would be to also mark the variable as "no longer cacheable" which in turn means "don't bother invalidating".

I'll try to fix this tomorrow.

headius · 2017-03-03T21:01:04Z

Split up read/write benchmark. Note that the reads of some of these types of variables will be optimized completely away by either our IR or by the JVM, so they're really just measuring the block dispatch overhead.

require 'benchmark'

LOOPS = 30000000

$global = 0
@outer_inst = 0
outer_scope = 0

loop { Benchmark.bmbm {|test|
    @inner_inst = 0
    inner_scope = 0

    test.report('$global w'){ LOOPS.times { $global = 1 } }
    test.report('@outer_inst w'){ LOOPS.times { @outer_inst = 1 } }
    test.report('outer_scope w'){ LOOPS.times { outer_scope = 1 } }
    test.report('@inner_inst w'){ LOOPS.times { @inner_inst = 1 } }
    test.report('inner_scope w'){ LOOPS.times { inner_scope = 1 } }
    test.report('local_scope w'){ local_scope = 0; LOOPS.times { local_scope = 1 } }
    test.report('$global r'){ LOOPS.times { $global } }
    test.report('@outer_inst r'){ LOOPS.times { @outer_inst } }
    test.report('outer_scope r'){ LOOPS.times { outer_scope } }
    test.report('@inner_inst r'){ LOOPS.times { @inner_inst } }
    test.report('inner_scope r'){ LOOPS.times { inner_scope } }
    test.report('local_scope r'){ local_scope = 0; LOOPS.times { local_scope } }
} }

headius · 2017-03-03T21:06:18Z

I was wrong; the overhead is not from constantly re-invalidating -- we have a failover point at which we just leave it permanently invalid -- but from doing a slow hash lookup of the variable every time. Since this affects all globals, including some that are expected to be updated if you're using them at all, I'll make appropriate modifications to avoid this lookup.

headius · 2017-03-03T21:21:30Z

Calling this one done. Global variable reads cache their value unless invalidated too many times (set by -Xinvokedynamic.global.maxfail=100) and cache the wrapper object after that to avoid hash lookup overhead. Global variable sets cache the wrapper object. That's about as cheap as I can make these right now.

ghost · 2017-03-03T21:48:18Z

Thanks for this! Just saw your progress and I'm glad to hear there's a good resolution.

rtyler added core JRuby 9000 labels Sep 25, 2015

kares added the performance label Jan 18, 2017

kares added this to the JRuby 9.1.8.0 milestone Jan 18, 2017

headius added a commit that referenced this issue Mar 3, 2017

Cache the GlobalVariable object for sets as well. Part of #3350.

b77ee98

headius closed this as completed Mar 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

JRuby 9.0.1.0 has slower $global_var usage #3350

JRuby 9.0.1.0 has slower $global_var usage #3350

ghost commented Sep 25, 2015

headius commented Sep 25, 2015

ghost commented Sep 25, 2015

headius commented Oct 2, 2015

headius commented Mar 3, 2017

headius commented Mar 3, 2017

headius commented Mar 3, 2017

headius commented Mar 3, 2017

ghost commented Mar 3, 2017

JRuby 9.0.1.0 has slower $global_var usage #3350

JRuby 9.0.1.0 has slower $global_var usage #3350

Comments

ghost commented Sep 25, 2015

headius commented Sep 25, 2015

ghost commented Sep 25, 2015

headius commented Oct 2, 2015

headius commented Mar 3, 2017

headius commented Mar 3, 2017

headius commented Mar 3, 2017

headius commented Mar 3, 2017

ghost commented Mar 3, 2017