Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

ivoanjo · 2017-05-24T15:56:48Z

Environment

Rubies:

jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 [linux-x86_64]
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]

Gems:

graphql 1.15.13

Linux:

Linux maruchan 4.11.0-041100-generic #201705041534 SMP Thu May 4 19:36:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 17.04

Expected Behavior

Performance when running the GraphQL gem's query lexer on JRuby is 1/5th of MRI's performance, and 1/10th when using JRuby with invokedynamic. It's possible that the code is subptimal and there's just nothing that can be improved without changing it, but I find especially weird the part where it's even slower with invokedynamic and thus decided to open this issue.

This has a big impact on our system since every GraphQL web request goes through this app (and I've identified this issue as profiling showed that about half of request time with JRuby was spent lexing the GraphQL query).

I'm using here an example graphql query from Github's documentation, but the performance behavior is the same as with the queries I was using.

Benchmark:

require 'graphql'
require 'benchmark/ips'

# https://githubengineering.com/the-github-graphql-api/

input_string = "{
  viewer {
    login
    starredRepositories {
      totalCount
    }
    repositories(first: 3) {
      edges {
        node {
          name
          stargazers {
            totalCount
          }
          forks {
            totalCount
          }
          watchers {
            totalCount
          }
          issues(states:[OPEN]) {
            totalCount
          }
        }
      }
    }
  }
}
"

puts "Running with #{RUBY_DESCRIPTION}"

Benchmark.ips do |benchmark|
  benchmark.time = 30
  benchmark.warmup = 30
  benchmark.report('Lexer') { GraphQL::Language::Lexer.tokenize(input_string) }
  benchmark.compare!
end

Actual Behavior

Performance on MRI:

Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   111.000  i/100ms
Calculating -------------------------------------
               Lexer      1.124k (± 3.9%) i/s -     33.744k in  30.069500s

Performance on JRuby:

Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    24.000  i/100ms
Calculating -------------------------------------
               Lexer    250.435  (± 4.8%) i/s -      7.512k in  30.080584s

Performance on JRuby with invokedynamic:

Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +indy +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    13.000  i/100ms
Calculating -------------------------------------
               Lexer    141.906  (± 7.0%) i/s -      4.238k in  30.067898s

The text was updated successfully, but these errors were encountered:

enebo · 2017-05-24T16:30:49Z

At a quick glance this suffers from the Ruby lexer generating a nearly 500 line Ruby method which once JITd will not actually compile down to native code. This is a known issue of JRuby but it is a difficult problem to solve hence no solution...yet.

With that said this only accounts for 1/2 of the profile but since all things are dispatched from this method I am unsure what other issues arise from calling through a non-native method.

enebo · 2017-05-24T16:32:39Z

For fun here is unJITd (-X-C) perf:

jruby -X-C ~/work/snippets/graph1.rb 
Running with jruby 9.1.10.0-SNAPSHOT (2.3.3) 2017-05-23 154275c Java HotSpot(TM) 64-Bit Server VM 25.121-b13 on 1.8.0_121-b13 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    24.000  i/100ms
Calculating -------------------------------------
               Lexer    245.724  (± 5.3%) i/s -      7.368k in  30.076030s

This is vs about 255 i/s with JIT enabled on my machine.

enebo · 2017-05-24T19:50:25Z

Just looking at how this generates and I forgot that there is a lambda to capture a couple of vars:

        emit_token = ->(name) {
          emit(name, ts, te, meta)
        }

This is not free either.

enebo · 2017-05-24T20:07:48Z

I manually removed begin/ends (this we might be able to reduce cost of internally -- I will examine this) and then also consolidated a line p = ((te))-1 and it upped the non-indy JIT to 803 i/s. So big improvement once we can get the main lexing method small enough codewise.

I replaced emit_token with emit which then put it back over too big of a method but then replaced on case when with just returning the name so I had something like name = case 13 then :INT. This got it to JIT again and removing the lambda brings me up to 825 i/s. This cost I think we can reduce more than we do today if we can implement hoisting. That is not short term work. Another not super close possibility is inlining but that will end up making this a big method so hoisting would work better.

enebo · 2017-05-24T20:23:48Z

Crazy I just mentioned removing excess begin/end but I see we already only have a cost of extra line instrs. I guess that this method will have hundreds of them. Perhaps I should try and prune adjacent ones?

ivoanjo · 2017-05-25T17:38:40Z

Thanks @enebo for the great analysis!

I also spent the day digging a bit into this code. The lexer itself is written in the lexer.rl file. This file is then compiled with ragel into the lexer.rb file.

So most of the code is ragel's output.

One interesting detail is that the compiled version of the lexer in the project looks like it was generated with ragel 6, the "stable" ragel version, whereas ragel 7 has a few variations in the way it can output ruby code.

Unfortunately, ragel 7 has also decided to drop support for ruby (amongst other languages) so to change it we'll need to fork version 7.0.0.9, the last version that supported ruby.

The good news is that one of the code generation options (-T1, which they describe as "Flat table with expanded actions") generates code that is both faster on MRI:

Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   209.000  i/100ms
Calculating -------------------------------------
               Lexer      2.102k (± 3.0%) i/s -     63.118k in  30.052143s

and on JRuby

Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    88.000  i/100ms
Calculating -------------------------------------
               Lexer    910.242  (± 3.0%) i/s -     27.280k in  29.999337s

than the current version with the gem.

(For completeness, it's also the highest performing version w/ invokedynamic, but it still doesn't compare to indy off).

Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +indy +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    31.000  i/100ms
Calculating -------------------------------------
               Lexer    323.366  (± 4.3%) i/s -      9.703k in  30.070710s

Quick gist with this code: https://gist.github.com/ivoanjo/041cfdb4fb07457c40a12e77bc738b00

My plan is to validate this version a bit and then submit it as a PR to the GraphQL gem. Funnyly enough, it actually has an ever larger #run_lexer, weighing in at 1102 lines!

Other than that, you mentioned a few optimizations that can be manually done in the generated code. Do you think it is useful to go in an change this in ragel (or even just do manually on its output), or do you believe it would be easy to just change JRuby instead?

enebo · 2017-05-25T19:55:53Z

@ivoanjo I think the only one I see which is not great for us is the lambda. I think we can eventually improve that perf on our side but indirecting through a closure is not a win for us (past the fact that it is less bytecode overall).

I did just commit b6254e843c3d5a8be8e256601f58c88e7f3bdd66 which did not have quite as much impact in this case but removed a number of unneeded line number instrs. I can also see an extra copy instr per case but I am not sure both changes are enough to reduce the size. Obviously, this is not a general solution but just shaving stuff done here and there.

I was wondering about ragel which indirected bodies through methods and set instance variables instead of local variables. That would likely be slower on MRI but on JRuby it could be faster since it would pretty much guarantee all things became native and the most common paths would likely inline. I think we do lose some optimizations like lvars becoming tempvars but compiling to native is nice :) Just some grist for the mill. In truth, what we really need is a generic solution in out JIT generation itself (the most popular idea is to make each BasicBlock of code its own synthetic method -- even though it means a level of extra indirection for temp vars).

rmosolgo · 2017-05-26T13:22:37Z

I think I can refactor out that lambda, I had a branch for that once but I didn't see a perf improvement on MRI, so I gave up. Let me give it a try over on that graphql-ruby PR!

ivoanjo · 2017-05-26T15:39:55Z

Thanks @enebo, it looks like you were right on the money 🎯.

By using ragel 7 AND getting rid of the lambda, performance is now as expected: JRuby + indy > JRuby > MRI. I've opened a PR with those fixes rmosolgo/graphql-ruby#737 .

ivoanjo · 2017-05-29T11:36:25Z

The fix has been merged and released, so I believe the issue can be marked as closed.

While investigating why query lexing was measurably slower in JRuby than with MRI (see jruby/jruby#4623 ) I noticed that GraphQL's lexer seemed to be generated with Ragel 6. I then noticed that this version was rather old and that a newer Ragel 7 seemed to have a lot of improvements on the Ruby output backend, even adding four different output options for generated lexers: * `-T0` Binary search (default) * `-T1` Binary search with expanded actions * `-F0` Flat table * `-F1` Flat table with expanded actions Testing all four of these combinations and comparing with the current GraphQL lexer on both MRI and JRuby showed that for a lexing-only benchmark (included below), Ragel 7's `-F1` improved MRI peformance by 87% and JRuby performance by 263%: * MRI before: ``` Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux] Warming up -------------------------------------- Lexer 111.000 i/100ms Calculating ------------------------------------- Lexer 1.124k (± 3.9%) i/s - 33.744k in 30.069500s ``` * MRI after: ``` Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux] Warming up -------------------------------------- Lexer 209.000 i/100ms Calculating ------------------------------------- Lexer 2.102k (± 3.0%) i/s - 63.118k in 30.052143s ``` * JRuby before: ``` Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64] Warming up -------------------------------------- Lexer 24.000 i/100ms Calculating ------------------------------------- Lexer 250.435 (± 4.8%) i/s - 7.512k in 30.080584s ``` * JRuby after: ``` Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64] Warming up -------------------------------------- Lexer 88.000 i/100ms Calculating ------------------------------------- Lexer 910.242 (± 3.0%) i/s - 27.280k in 29.999337s ``` Unfortunately, this does not yet bring JRuby performance to MRI's level, and I plan to continue looking at how to improve this. An important note is that the latest Ragel 7 release drops support for a lot of languages, unfortunately including Ruby (see https://www.colm.net/news/2016/07/31/ragel-target-langs.html for the announcement) so the last currently-available Ragel 7 version that can be used is 7.0.0.9, which was the version I used. Lexing benchmark used to obtain the numbers above: ```ruby require 'graphql' require 'benchmark/ips' # https://githubengineering.com/the-github-graphql-api/ input_string = "{ viewer { login starredRepositories { totalCount } repositories(first: 3) { edges { node { name stargazers { totalCount } forks { totalCount } watchers { totalCount } issues(states:[OPEN]) { totalCount } } } } } } " puts "Running with #{RUBY_DESCRIPTION}" Benchmark.ips do |benchmark| benchmark.time = 30 benchmark.warmup = 30 benchmark.report('Lexer') { GraphQL::Language::Lexer.tokenize(input_string) } benchmark.compare! end ```

@enebo

As @enebo suggested in jruby/jruby#4623 there is a non-trivial performance impact of using the emit_token lambda instead of directly calling the emit method during lexing. By changing this to a normal method call performance in MRI is mostly not impacted, but this finally enables JRuby to optimize lexing code, putting it finally ahead and beyond MRI. * MRI before (e.g. after previous commit): ``` Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux] Warming up -------------------------------------- Lexer 209.000 i/100ms Calculating ------------------------------------- Lexer 2.102k (± 3.0%) i/s - 63.118k in 30.052143s ``` * MRI after (negligible impact): ``` Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux] Warming up -------------------------------------- Lexer 211.000 i/100ms Calculating ------------------------------------- Lexer 2.121k (± 3.3%) i/s - 63.722k in 30.080212s ``` * JRuby before (e.g. after previous commit): ``` Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64] Warming up -------------------------------------- Lexer 88.000 i/100ms Calculating ------------------------------------- Lexer 910.242 (± 3.0%) i/s - 27.280k in 29.999337s ``` * JRuby after: ``` Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64] Warming up -------------------------------------- Lexer 372.000 i/100ms Calculating ------------------------------------- Lexer 4.078k (± 5.0%) i/s - 122.016k in 30.008956s ``` * JRuby after (with -Xcompile.invokedynamic=true): ``` Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +indy +jit [linux-x86_64] Warming up -------------------------------------- Lexer 800.000 i/100ms Calculating ------------------------------------- Lexer 9.649k (± 7.9%) i/s - 287.200k in 30.009197s ``` Finally, note that in the master branch, JRuby with -Xcompile.invokedynamic=true is the worst performing option (clocking in at about 141k i/s), so it went from being the slowest to being the fastest by far, with a speedup of about 68x.

ivoanjo mentioned this issue May 26, 2017

Improve lexing performance by generating lexer with Ragel 7.0.0.9 and avoiding lambda in token emission rmosolgo/graphql-ruby#737

Merged

ivoanjo closed this as completed May 29, 2017

enebo added this to the Invalid or Duplicate milestone Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

ivoanjo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

ivoanjo commented May 25, 2017 •

edited

Loading

enebo commented May 25, 2017

rmosolgo commented May 26, 2017

ivoanjo commented May 26, 2017

ivoanjo commented May 29, 2017

Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

Comments

ivoanjo commented May 24, 2017

Environment

Expected Behavior

Actual Behavior

enebo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

enebo commented May 24, 2017

ivoanjo commented May 25, 2017 • edited Loading

enebo commented May 25, 2017

rmosolgo commented May 26, 2017

ivoanjo commented May 26, 2017

ivoanjo commented May 29, 2017

ivoanjo commented May 25, 2017 •

edited

Loading