Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

Closed
ivoanjo opened this issue May 24, 2017 · 10 comments
Closed

Unexpected performance with GraphQL gem: JRuby indy < JRuby < MRI #4623

ivoanjo opened this issue May 24, 2017 · 10 comments

Comments

@ivoanjo
Copy link
Contributor

ivoanjo commented May 24, 2017

Environment

Rubies:

  • jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 [linux-x86_64]
  • ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]

Gems:

  • graphql 1.15.13

Linux:

  • Linux maruchan 4.11.0-041100-generic #201705041534 SMP Thu May 4 19:36:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Ubuntu 17.04

Expected Behavior

Performance when running the GraphQL gem's query lexer on JRuby is 1/5th of MRI's performance, and 1/10th when using JRuby with invokedynamic. It's possible that the code is subptimal and there's just nothing that can be improved without changing it, but I find especially weird the part where it's even slower with invokedynamic and thus decided to open this issue.

This has a big impact on our system since every GraphQL web request goes through this app (and I've identified this issue as profiling showed that about half of request time with JRuby was spent lexing the GraphQL query).

I'm using here an example graphql query from Github's documentation, but the performance behavior is the same as with the queries I was using.

Benchmark:

require 'graphql'
require 'benchmark/ips'

# https://githubengineering.com/the-github-graphql-api/

input_string = "{
  viewer {
    login
    starredRepositories {
      totalCount
    }
    repositories(first: 3) {
      edges {
        node {
          name
          stargazers {
            totalCount
          }
          forks {
            totalCount
          }
          watchers {
            totalCount
          }
          issues(states:[OPEN]) {
            totalCount
          }
        }
      }
    }
  }
}
"

puts "Running with #{RUBY_DESCRIPTION}"

Benchmark.ips do |benchmark|
  benchmark.time = 30
  benchmark.warmup = 30
  benchmark.report('Lexer') { GraphQL::Language::Lexer.tokenize(input_string) }
  benchmark.compare!
end

Actual Behavior

  • Performance on MRI:
Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   111.000  i/100ms
Calculating -------------------------------------
               Lexer      1.124k (± 3.9%) i/s -     33.744k in  30.069500s
  • Performance on JRuby:
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    24.000  i/100ms
Calculating -------------------------------------
               Lexer    250.435  (± 4.8%) i/s -      7.512k in  30.080584s
  • Performance on JRuby with invokedynamic:
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +indy +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    13.000  i/100ms
Calculating -------------------------------------
               Lexer    141.906  (± 7.0%) i/s -      4.238k in  30.067898s
@enebo
Copy link
Member

enebo commented May 24, 2017

At a quick glance this suffers from the Ruby lexer generating a nearly 500 line Ruby method which once JITd will not actually compile down to native code. This is a known issue of JRuby but it is a difficult problem to solve hence no solution...yet.

With that said this only accounts for 1/2 of the profile but since all things are dispatched from this method I am unsure what other issues arise from calling through a non-native method.

@enebo
Copy link
Member

enebo commented May 24, 2017

For fun here is unJITd (-X-C) perf:

jruby -X-C ~/work/snippets/graph1.rb 
Running with jruby 9.1.10.0-SNAPSHOT (2.3.3) 2017-05-23 154275c Java HotSpot(TM) 64-Bit Server VM 25.121-b13 on 1.8.0_121-b13 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    24.000  i/100ms
Calculating -------------------------------------
               Lexer    245.724  (± 5.3%) i/s -      7.368k in  30.076030s

This is vs about 255 i/s with JIT enabled on my machine.

@enebo
Copy link
Member

enebo commented May 24, 2017

Just looking at how this generates and I forgot that there is a lambda to capture a couple of vars:

        emit_token = ->(name) {
          emit(name, ts, te, meta)
        }

This is not free either.

@enebo
Copy link
Member

enebo commented May 24, 2017

I manually removed begin/ends (this we might be able to reduce cost of internally -- I will examine this) and then also consolidated a line p = ((te))-1 and it upped the non-indy JIT to 803 i/s. So big improvement once we can get the main lexing method small enough codewise.

I replaced emit_token with emit which then put it back over too big of a method but then replaced on case when with just returning the name so I had something like name = case 13 then :INT. This got it to JIT again and removing the lambda brings me up to 825 i/s. This cost I think we can reduce more than we do today if we can implement hoisting. That is not short term work. Another not super close possibility is inlining but that will end up making this a big method so hoisting would work better.

@enebo
Copy link
Member

enebo commented May 24, 2017

Crazy I just mentioned removing excess begin/end but I see we already only have a cost of extra line instrs. I guess that this method will have hundreds of them. Perhaps I should try and prune adjacent ones?

@ivoanjo
Copy link
Contributor Author

ivoanjo commented May 25, 2017

Thanks @enebo for the great analysis!

I also spent the day digging a bit into this code. The lexer itself is written in the lexer.rl file. This file is then compiled with ragel into the lexer.rb file.

So most of the code is ragel's output.

One interesting detail is that the compiled version of the lexer in the project looks like it was generated with ragel 6, the "stable" ragel version, whereas ragel 7 has a few variations in the way it can output ruby code.

Unfortunately, ragel 7 has also decided to drop support for ruby (amongst other languages) so to change it we'll need to fork version 7.0.0.9, the last version that supported ruby.

The good news is that one of the code generation options (-T1, which they describe as "Flat table with expanded actions") generates code that is both faster on MRI:

Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   209.000  i/100ms
Calculating -------------------------------------
               Lexer      2.102k (± 3.0%) i/s -     63.118k in  30.052143s

and on JRuby

Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    88.000  i/100ms
Calculating -------------------------------------
               Lexer    910.242  (± 3.0%) i/s -     27.280k in  29.999337s

than the current version with the gem.

(For completeness, it's also the highest performing version w/ invokedynamic, but it still doesn't compare to indy off).

Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +indy +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    31.000  i/100ms
Calculating -------------------------------------
               Lexer    323.366  (± 4.3%) i/s -      9.703k in  30.070710s

Quick gist with this code: https://gist.github.com/ivoanjo/041cfdb4fb07457c40a12e77bc738b00

My plan is to validate this version a bit and then submit it as a PR to the GraphQL gem. Funnyly enough, it actually has an ever larger #run_lexer, weighing in at 1102 lines!

Other than that, you mentioned a few optimizations that can be manually done in the generated code. Do you think it is useful to go in an change this in ragel (or even just do manually on its output), or do you believe it would be easy to just change JRuby instead?

@enebo
Copy link
Member

enebo commented May 25, 2017

@ivoanjo I think the only one I see which is not great for us is the lambda. I think we can eventually improve that perf on our side but indirecting through a closure is not a win for us (past the fact that it is less bytecode overall).

I did just commit b6254e843c3d5a8be8e256601f58c88e7f3bdd66 which did not have quite as much impact in this case but removed a number of unneeded line number instrs. I can also see an extra copy instr per case but I am not sure both changes are enough to reduce the size. Obviously, this is not a general solution but just shaving stuff done here and there.

I was wondering about ragel which indirected bodies through methods and set instance variables instead of local variables. That would likely be slower on MRI but on JRuby it could be faster since it would pretty much guarantee all things became native and the most common paths would likely inline. I think we do lose some optimizations like lvars becoming tempvars but compiling to native is nice :) Just some grist for the mill. In truth, what we really need is a generic solution in out JIT generation itself (the most popular idea is to make each BasicBlock of code its own synthetic method -- even though it means a level of extra indirection for temp vars).

@rmosolgo
Copy link

I think I can refactor out that lambda, I had a branch for that once but I didn't see a perf improvement on MRI, so I gave up. Let me give it a try over on that graphql-ruby PR!

@ivoanjo
Copy link
Contributor Author

ivoanjo commented May 26, 2017

Thanks @enebo, it looks like you were right on the money 🎯.

By using ragel 7 AND getting rid of the lambda, performance is now as expected: JRuby + indy > JRuby > MRI. I've opened a PR with those fixes rmosolgo/graphql-ruby#737 .

@ivoanjo
Copy link
Contributor Author

ivoanjo commented May 29, 2017

The fix has been merged and released, so I believe the issue can be marked as closed.

@ivoanjo ivoanjo closed this as completed May 29, 2017
@enebo enebo added this to the Invalid or Duplicate milestone Jun 13, 2017
riffraff pushed a commit to riffraff/graphql-ruby that referenced this issue Jun 21, 2017

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
While investigating why query lexing was measurably slower in JRuby
than with MRI (see jruby/jruby#4623 ) I
noticed that GraphQL's lexer seemed to be generated with Ragel 6.

I then noticed that this version was rather old and that a newer
Ragel 7 seemed to have a lot of improvements on the Ruby output
backend, even adding four different output options for generated lexers:

* `-T0` Binary search (default)
* `-T1` Binary search with expanded actions
* `-F0` Flat table
* `-F1` Flat table with expanded actions

Testing all four of these combinations and comparing with the current
GraphQL lexer on both MRI and JRuby showed that for a lexing-only
benchmark (included below), Ragel 7's `-F1` improved MRI peformance
by 87% and JRuby performance by 263%:

* MRI before:

```
Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   111.000  i/100ms
Calculating -------------------------------------
               Lexer      1.124k (± 3.9%) i/s -     33.744k in  30.069500s
```

* MRI after:

```
Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   209.000  i/100ms
Calculating -------------------------------------
               Lexer      2.102k (± 3.0%) i/s -     63.118k in  30.052143s
```

* JRuby before:

```
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM)
64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    24.000  i/100ms
Calculating -------------------------------------
               Lexer    250.435  (± 4.8%) i/s -      7.512k in  30.080584s
```

* JRuby after:

```
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM)
64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    88.000  i/100ms
Calculating -------------------------------------
               Lexer    910.242  (± 3.0%) i/s -     27.280k in  29.999337s
```

Unfortunately, this does not yet bring JRuby performance to MRI's
level, and I plan to continue looking at how to improve this.

An important note is that the latest Ragel 7 release drops support for
a lot of languages, unfortunately including Ruby (see
https://www.colm.net/news/2016/07/31/ragel-target-langs.html for the
announcement) so the last currently-available Ragel 7 version that
can be used is 7.0.0.9, which was the version I used.

Lexing benchmark used to obtain the numbers above:

```ruby
require 'graphql'
require 'benchmark/ips'

 # https://githubengineering.com/the-github-graphql-api/

input_string = "{
  viewer {
    login
    starredRepositories {
      totalCount
    }
    repositories(first: 3) {
      edges {
        node {
          name
          stargazers {
            totalCount
          }
          forks {
            totalCount
          }
          watchers {
            totalCount
          }
          issues(states:[OPEN]) {
            totalCount
          }
        }
      }
    }
  }
}
"

puts "Running with #{RUBY_DESCRIPTION}"

Benchmark.ips do |benchmark|
  benchmark.time = 30
  benchmark.warmup = 30
  benchmark.report('Lexer') {
GraphQL::Language::Lexer.tokenize(input_string) }
  benchmark.compare!
end
```
riffraff pushed a commit to riffraff/graphql-ruby that referenced this issue Jun 21, 2017
As @enebo suggested in jruby/jruby#4623 there
is a non-trivial performance impact of using the emit_token lambda
instead of directly calling the emit method during lexing.

By changing this to a normal method call performance in MRI is mostly
not impacted, but this finally enables JRuby to optimize lexing code,
putting it finally ahead and beyond MRI.

* MRI before (e.g. after previous commit):

```
Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   209.000  i/100ms
Calculating -------------------------------------
               Lexer      2.102k (± 3.0%) i/s -     63.118k in  30.052143s
```

* MRI after (negligible impact):

```
Running with ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
Warming up --------------------------------------
               Lexer   211.000  i/100ms
Calculating -------------------------------------
               Lexer      2.121k (± 3.3%) i/s -     63.722k in  30.080212s
```

* JRuby before (e.g. after previous commit):

```
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM)
64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer    88.000  i/100ms
Calculating -------------------------------------
               Lexer    910.242  (± 3.0%) i/s -     27.280k in  29.999337s
```

* JRuby after:

```
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM)
64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer   372.000  i/100ms
Calculating -------------------------------------
               Lexer      4.078k (± 5.0%) i/s -    122.016k in  30.008956s
```

* JRuby after (with -Xcompile.invokedynamic=true):

```
Running with jruby 9.1.9.0 (2.3.3) 2017-05-15 28aa830 Java HotSpot(TM)
64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +indy +jit [linux-x86_64]
Warming up --------------------------------------
               Lexer   800.000  i/100ms
Calculating -------------------------------------
               Lexer      9.649k (± 7.9%) i/s -    287.200k in  30.009197s
```

Finally, note that in the master branch, JRuby with
-Xcompile.invokedynamic=true is the worst performing option (clocking in
at about 141k i/s), so it went from being the slowest to being the
fastest by far, with a speedup of about 68x.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants