review String -> RubyString UTF (8) encoding #5239

kares · 2018-07-05T17:35:34Z

first UTF-16 is broken on Java 10+ since it assumes char[] value String internals (using Unsafe)

doing that, spotted a few less copy byte[]/char[] opportunities and started micro-benchmarking

by separating String/CharSequence paths down the encoding pipe there seems to be gains (maybe its likely due HotSpot's JIT assuming the CS method as mono-morphic as they only get a StringBuilder)

but since the micro benchmark wasn't assuring (non StringBuilder CharSequence paths do get slower in JMH) I did some AR-JDBC "raw" AR benchmarking ...
surprisingly this seems to make sense as there seems to be a noticeable almost 5% speed gain

kares · 2018-07-05T17:42:48Z

AR-JDBC 5.1 numbers

JRuby 9.1

                                                                               user     system      total        real
BenchRecord.select('a_binary').where(id: i).first [9000x]                  2.620000   0.170000   2.790000 (  3.542872)
BenchRecord.select('a_boolean').where(id: i).first [9000x]                 2.600000   0.190000   2.790000 (  3.502637)
BenchRecord.select('a_date').where(id: i).first [9000x]                    2.750000   0.150000   2.900000 (  3.681156)
BenchRecord.select('a_datetime').where(id: i).first [9000x]                2.670000   0.170000   2.840000 (  3.528049)
BenchRecord.select('a_decimal').where(id: i).first [9000x]                 2.630000   0.170000   2.800000 (  3.550048)
BenchRecord.select('a_float').where(id: i).first [9000x]                   2.540000   0.210000   2.750000 (  3.488340)
BenchRecord.select('a_integer').where(id: i).first [9000x]                 2.670000   0.150000   2.820000 (  3.583536)
BenchRecord.select('a_string').where(id: i).first [9000x]                  2.650000   0.160000   2.810000 (  3.579523)
BenchRecord.select('a_text').where(id: i).first [9000x]                    2.670000   0.120000   2.790000 (  3.569486)
BenchRecord.select('a_time').where(id: i).first [9000x]                    2.630000   0.170000   2.800000 (  3.565073)
BenchRecord.select('a_timestamp').where(id: i).first [9000x]               2.620000   0.140000   2.760000 (  3.519952)
BenchRecord.select('*').where(id: i).first [9000x]                         3.030000   0.190000   3.220000 (  4.241088)
BenchRecord.select('a_binary').where(['id = ?', an_id]).first [9000x]      2.270000   0.140000   2.410000 (  3.143300)
BenchRecord.select('a_boolean').where(['id = ?', an_id]).first [9000x]     2.330000   0.140000   2.470000 (  3.130664)
BenchRecord.select('a_date').where(['id = ?', an_id]).first [9000x]        2.420000   0.150000   2.570000 (  3.304973)
BenchRecord.select('a_datetime').where(['id = ?', an_id]).first [9000x]    2.340000   0.160000   2.500000 (  3.216238)
BenchRecord.select('a_decimal').where(['id = ?', an_id]).first [9000x]     2.260000   0.180000   2.440000 (  3.160046)
BenchRecord.select('a_float').where(['id = ?', an_id]).first [9000x]       2.330000   0.130000   2.460000 (  3.176588)
BenchRecord.select('a_integer').where(['id = ?', an_id]).first [9000x]     2.300000   0.160000   2.460000 (  3.151543)
BenchRecord.select('a_string').where(['id = ?', an_id]).first [9000x]      2.280000   0.140000   2.420000 (  3.135031)
BenchRecord.select('a_text').where(['id = ?', an_id]).first [9000x]        2.310000   0.140000   2.450000 (  3.156599)
BenchRecord.select('a_time').where(['id = ?', an_id]).first [9000x]        2.320000   0.150000   2.470000 (  3.138669)
BenchRecord.select('a_timestamp').where(['id = ?', an_id]).first [9000x]   2.350000   0.110000   2.460000 (  3.181043)
BenchRecord.select('*').where(['id = ?', an_id]).first [9000x]             2.470000   0.170000   2.640000 (  3.456422)

JRuby 9.1 + with newDefaultInternalString replacement

                                                                               user     system      total        real
BenchRecord.select('a_binary').where(id: i).first [9000x]                  2.340000   0.130000   2.470000 (  3.030340)
BenchRecord.select('a_boolean').where(id: i).first [9000x]                 2.200000   0.150000   2.350000 (  3.004720)
BenchRecord.select('a_date').where(id: i).first [9000x]                    2.450000   0.090000   2.540000 (  3.145064)
BenchRecord.select('a_datetime').where(id: i).first [9000x]                2.170000   0.160000   2.330000 (  2.965873)
BenchRecord.select('a_decimal').where(id: i).first [9000x]                 2.260000   0.100000   2.360000 (  3.000331)
BenchRecord.select('a_float').where(id: i).first [9000x]                   2.320000   0.120000   2.440000 (  3.105647)
BenchRecord.select('a_integer').where(id: i).first [9000x]                 2.300000   0.110000   2.410000 (  3.062773)
BenchRecord.select('a_string').where(id: i).first [9000x]                  2.280000   0.150000   2.430000 (  3.073316)
BenchRecord.select('a_text').where(id: i).first [9000x]                    2.330000   0.150000   2.480000 (  3.125176)
BenchRecord.select('a_time').where(id: i).first [9000x]                    2.250000   0.140000   2.390000 (  3.032169)
BenchRecord.select('a_timestamp').where(id: i).first [9000x]               2.220000   0.130000   2.350000 (  3.024004)
BenchRecord.select('*').where(id: i).first [9000x]                         2.600000   0.170000   2.770000 (  3.508368)
BenchRecord.select('a_binary').where(['id = ?', an_id]).first [9000x]      2.000000   0.110000   2.110000 (  2.720760)
BenchRecord.select('a_boolean').where(['id = ?', an_id]).first [9000x]     1.940000   0.150000   2.090000 (  2.710401)
BenchRecord.select('a_date').where(['id = ?', an_id]).first [9000x]        2.040000   0.180000   2.220000 (  2.851454)
BenchRecord.select('a_datetime').where(['id = ?', an_id]).first [9000x]    2.000000   0.140000   2.140000 (  2.766859)
BenchRecord.select('a_decimal').where(['id = ?', an_id]).first [9000x]     1.930000   0.200000   2.130000 (  2.745211)
BenchRecord.select('a_float').where(['id = ?', an_id]).first [9000x]       2.020000   0.120000   2.140000 (  2.760144)
BenchRecord.select('a_integer').where(['id = ?', an_id]).first [9000x]     1.960000   0.180000   2.140000 (  2.787334)
BenchRecord.select('a_string').where(['id = ?', an_id]).first [9000x]      1.970000   0.180000   2.150000 (  2.807925)
BenchRecord.select('a_text').where(['id = ?', an_id]).first [9000x]        1.960000   0.160000   2.120000 (  2.766127)
BenchRecord.select('a_time').where(['id = ?', an_id]).first [9000x]        2.010000   0.110000   2.120000 (  2.747843)
BenchRecord.select('a_timestamp').where(['id = ?', an_id]).first [9000x]   1.960000   0.170000   2.130000 (  2.756853)
BenchRecord.select('*').where(['id = ?', an_id]).first [9000x]             2.240000   0.180000   2.420000 (  3.164800)

JRuby 9.2.1 SNAPSHOT

                                                                               user     system      total        real
BenchRecord.select('a_binary').where(id: i).first [9000x]                  2.090000   0.140000   2.230000 (  2.890846)
BenchRecord.select('a_boolean').where(id: i).first [9000x]                 2.080000   0.140000   2.220000 (  2.872279)
BenchRecord.select('a_date').where(id: i).first [9000x]                    2.150000   0.170000   2.320000 (  2.968458)
BenchRecord.select('a_datetime').where(id: i).first [9000x]                2.160000   0.160000   2.320000 (  2.938405)
BenchRecord.select('a_decimal').where(id: i).first [9000x]                 2.120000   0.160000   2.280000 (  2.936794)
BenchRecord.select('a_float').where(id: i).first [9000x]                   2.130000   0.130000   2.260000 (  2.908788)
BenchRecord.select('a_integer').where(id: i).first [9000x]                 2.170000   0.150000   2.320000 (  2.979247)
BenchRecord.select('a_string').where(id: i).first [9000x]                  2.120000   0.130000   2.250000 (  2.911211)
BenchRecord.select('a_text').where(id: i).first [9000x]                    2.140000   0.160000   2.300000 (  2.957397)
BenchRecord.select('a_time').where(id: i).first [9000x]                    2.110000   0.180000   2.290000 (  2.963789)
BenchRecord.select('a_timestamp').where(id: i).first [9000x]               2.100000   0.150000   2.250000 (  2.908456)
BenchRecord.select('*').where(id: i).first [9000x]                         2.470000   0.140000   2.610000 (  3.411368)
BenchRecord.select('a_binary').where(['id = ?', an_id]).first [9000x]      1.900000   0.130000   2.030000 (  2.667544)
BenchRecord.select('a_boolean').where(['id = ?', an_id]).first [9000x]     1.930000   0.120000   2.050000 (  2.642570)
BenchRecord.select('a_date').where(['id = ?', an_id]).first [9000x]        1.920000   0.160000   2.080000 (  2.707721)
BenchRecord.select('a_datetime').where(['id = ?', an_id]).first [9000x]    1.870000   0.150000   2.020000 (  2.689717)
BenchRecord.select('a_decimal').where(['id = ?', an_id]).first [9000x]     1.940000   0.160000   2.100000 (  2.756819)
BenchRecord.select('a_float').where(['id = ?', an_id]).first [9000x]       1.870000   0.190000   2.060000 (  2.706127)
BenchRecord.select('a_integer').where(['id = ?', an_id]).first [9000x]     1.840000   0.160000   2.000000 (  2.661135)
BenchRecord.select('a_string').where(['id = ?', an_id]).first [9000x]      1.910000   0.150000   2.060000 (  2.699303)
BenchRecord.select('a_text').where(['id = ?', an_id]).first [9000x]        1.930000   0.140000   2.070000 (  2.722707)
BenchRecord.select('a_time').where(['id = ?', an_id]).first [9000x]        1.960000   0.120000   2.080000 (  2.742339)
BenchRecord.select('a_timestamp').where(['id = ?', an_id]).first [9000x]   1.850000   0.130000   1.980000 (  2.631308)
BenchRecord.select('*').where(['id = ?', an_id]).first [9000x]             2.210000   0.120000   2.330000 (  3.091712)

headius · 2018-07-05T22:30:45Z

Numbers look good...I'll review the change.

... this likely isn't used that much since it might have failed for cases where String's char[] is shared (int offset being > 0) also this would need special care in Java 10 where its a byte[]

... for non-direct ByteBuffer we can extract bytes directly

doing toString() does not make a difference in micro-benchmarks thus could as well not char[] copy esp. since one doesn't know what kind of CharSequence objects might come along ...

and no longer fill in null encoding - we always pass it down interestingly, with micro-benchmarks, this seems to run better passing a StringBuilder down seems to get a very noticeable speed improvement, while String cases stay around the same performance

BEFORE: ``` Benchmark Mode Cnt Score Error Units EncodingBenchmark.benchLongRubyStringNew thrpt 5 7104.064 ± 252.231 ops/ms EncodingBenchmark.benchLongRubyStringNewCharSequence thrpt 5 6882.044 ± 133.946 ops/ms EncodingBenchmark.benchLongRubyStringNewCharSequence2 thrpt 5 7059.163 ± 208.203 ops/ms EncodingBenchmark.benchLongRubyStringNewCharSequence3 thrpt 5 7177.851 ± 188.033 ops/ms EncodingBenchmark.benchShortRubyStringNew thrpt 5 15108.288 ± 282.496 ops/ms EncodingBenchmark.benchShortRubyStringNewCharSequence thrpt 5 14342.470 ± 101.090 ops/ms EncodingBenchmark.benchVeryLongRubyStringNew thrpt 5 1173.092 ± 8.716 ops/ms EncodingBenchmark.benchVeryLongRubyStringNewCharSequence thrpt 5 1017.636 ± 58.843 ops/ms EncodingBenchmark.benchVeryLongRubyStringNewCharSequence2 thrpt 5 1065.907 ± 26.763 ops/ms ``` AFTER: ``` Benchmark Mode Cnt Score Error Units EncodingBenchmark.benchLongRubyStringNew thrpt 5 7205.086 ± 474.930 ops/ms EncodingBenchmark.benchLongRubyStringNewCharSequence thrpt 5 9239.360 ± 338.284 ops/ms EncodingBenchmark.benchLongRubyStringNewCharSequence2 thrpt 5 4425.827 ± 246.294 ops/ms EncodingBenchmark.benchLongRubyStringNewCharSequence3 thrpt 5 7661.631 ± 418.873 ops/ms EncodingBenchmark.benchShortRubyStringNew thrpt 5 15875.130 ± 926.360 ops/ms EncodingBenchmark.benchShortRubyStringNewCharSequence thrpt 5 16137.382 ± 1024.177 ops/ms EncodingBenchmark.benchVeryLongRubyStringNew thrpt 5 1149.699 ± 27.375 ops/ms EncodingBenchmark.benchVeryLongRubyStringNewCharSequence thrpt 5 1982.773 ± 133.350 ops/ms EncodingBenchmark.benchVeryLongRubyStringNewCharSequence2 thrpt 5 634.528 ± 224.842 ops/ms ```

... makes no difference for micro-benchmarks but we at least won't copy char[] buffers around - its clearly a user intent to encode the buffer

kares · 2018-08-21T14:01:36Z

let's ship before it gets behind ... CI didn't spot any regressions.

kares force-pushed the string-encode branch from 14ce818 to 4f8d0ac Compare July 5, 2018 20:30

kares force-pushed the string-encode branch from 4f8d0ac to 7270745 Compare July 6, 2018 11:15

kares added 5 commits July 7, 2018 14:33

[fix] encode UTF-16 without unwrapping Java String internals

1a09fe3

... this likely isn't used that much since it might have failed for cases where String's char[] is shared (int offset being > 0) also this would need special care in Java 10 where its a byte[]

[refactor] avoid a byte[] copy on encode when possible

335dc2e

... for non-direct ByteBuffer we can extract bytes directly

handle String/CharSequence decoding slightly differently

c253b45

doing toString() does not make a difference in micro-benchmarks thus could as well not char[] copy esp. since one doesn't know what kind of CharSequence objects might come along ...

[refactor] only one ASCII enc + use enc.charset when not null

c766b25

kares force-pushed the string-encode branch 2 times, most recently from cb57a76 to 5105163 Compare July 16, 2018 14:35

kares added 5 commits July 16, 2018 16:36

do not asString Symbols + assume reasonable thread-safety

e9ac48e

[refactor] no need for ByteList wrapping - use bare byte[]

5717f94

[refactor] cleanup - remove "test" main method

c63ef49

when CharBuffer is to be encoded do not wrap it again

5105163

... makes no difference for micro-benchmarks but we at least won't copy char[] buffers around - its clearly a user intent to encode the buffer

kares merged commit b1f5b01 into jruby:master Aug 21, 2018

kares added this to the JRuby 9.2.1.0 milestone Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

review String -> RubyString UTF (8) encoding #5239

review String -> RubyString UTF (8) encoding #5239

kares commented Jul 5, 2018 •

edited

Loading

kares commented Jul 5, 2018

headius commented Jul 5, 2018

kares commented Aug 21, 2018

review String -> RubyString UTF (8) encoding #5239

review String -> RubyString UTF (8) encoding #5239

Conversation

kares commented Jul 5, 2018 • edited Loading

kares commented Jul 5, 2018

AR-JDBC 5.1 numbers

JRuby 9.1

JRuby 9.1 + with newDefaultInternalString replacement

JRuby 9.2.1 SNAPSHOT

headius commented Jul 5, 2018

kares commented Aug 21, 2018

kares commented Jul 5, 2018 •

edited

Loading