Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prime? very slow #3649

Closed
giuan opened this issue Feb 6, 2016 · 11 comments
Closed

prime? very slow #3649

giuan opened this issue Feb 6, 2016 · 11 comments

Comments

@giuan
Copy link

giuan commented Feb 6, 2016

The method prime? is 20x slower then MRI.
But a method implemented by myself give the same time as MRI prime? in MRI and jruby.
I think the problem is in the library prime.
Try

require 'prime'
require 'benchmark'
n = 1090109110921093
Benchmark.bm do |r|
  4.times do
    r.report {puts n.prime?}
  end
end

Tiziano Merzi

P.S.
jruby 9.0.5.0 on MAC

@headius
Copy link
Member

headius commented Feb 6, 2016

Wow. Ok, I'll look into this. Definitely something wrong.

@headius
Copy link
Member

headius commented Feb 6, 2016

It looks like this is never getting out of the (simple) interpreter:

     Compiled + native   Method                        
 28.9%  1649  +     8    org.jruby.ir.interpreter.StartupInterpreterEngine.interpret
 25.5%  1460  +     0    org.jruby.ir.interpreter.StartupInterpreterEngine.processOtherOp
 10.8%   620  +     0    org.jruby.ir.IRScope.getFileName
  6.8%   390  +     0    org.jruby.runtime.BlockBody.yield
  6.1%   351  +     0    org.jruby.runtime.callsite.CachingCallSite.call
  6.1%   348  +     0    org.jruby.ir.interpreter.InterpreterEngine.processCall
  4.4%   251  +     0    org.jruby.ir.instructions.BFalseInstr.interpretAndGetNewIPC
  3.9%   224  +     0    java.lang.invoke.LambdaForm$DMH.invokeStatic_L6_L
  1.4%    77  +     1    org.jruby.ir.operands.LocalVariable.retrieve
  1.2%    67  +     1    org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath
  0.6%    36  +     0    org.jruby.RubyKernel.loop

@headius
Copy link
Member

headius commented Feb 6, 2016

It appears that at least part of this bad benchmark result is due to the fact that most of the hot call path is not compiling before the benchmark completes. If I force JIT to happen sooner, the numbers improve (but still don't beat MRI).

It's not clear to me yet which methods are getting called.

@headius
Copy link
Member

headius commented Feb 6, 2016

Ok, it looks like Ruby 2.3 improved the performance of prime.rb. JRuby 9.0.5.0 roughly matches MRI 2.2 performance, and JRuby with 2.3 stdlib roughly matches MRI 2.3 performance:

Ruby 2.2 stdlib:

[] ~/projects/jruby $ jruby -Xjit.threshold=0 -Xcompile.invokedynamic bench_prime.rb 
       user     system      total        real
 true
  3.170000   0.110000   3.280000 (  2.220189)
 true
  2.520000   0.030000   2.550000 (  2.033816)
 true
  2.100000   0.010000   2.110000 (  1.926965)
 true
  1.930000   0.010000   1.940000 (  1.879152)

[] ~/projects/jruby $ ruby22 bench_prime.rb 
       user     system      total        real
 true
  3.530000   0.010000   3.540000 (  3.533079)
 true
  3.530000   0.000000   3.530000 (  3.530615)
 true
  3.590000   0.010000   3.600000 (  3.596020)
 true
  3.430000   0.000000   3.430000 (  3.436481)

Ruby 2.3 stdlib

[] ~/projects/jruby $ jruby -I ../ruby/lib -Xjit.threshold=0 -Xcompile.invokedynamic bench_prime.rb 
       user     system      total        real
 true
  1.910000   0.100000   2.010000 (  0.934381)
 true
  0.700000   0.010000   0.710000 (  0.631105)
 true
  1.030000   0.010000   1.040000 (  0.748262)
 true
  0.630000   0.010000   0.640000 (  0.610905)

[] ~/projects/jruby $ ruby23 bench_prime.rb 
       user     system      total        real
 true
  0.640000   0.000000   0.640000 (  0.637603)
 true
  0.630000   0.000000   0.630000 (  0.629570)
 true
  0.640000   0.000000   0.640000 (  0.637175)
 true
  0.630000   0.000000   0.630000 (  0.636287)

@giuan
Copy link
Author

giuan commented Feb 6, 2016

This is the link
https://gist.github.com/giuan/f2176226e253d8620e0a
In jruby prime? is 20x slower then my method

@headius
Copy link
Member

headius commented Feb 6, 2016

Ok, so other than the 2.2/2.3 stdlib issue, the remaining problem here is that a single long-running method does not trigger JIT.

We don't have a good way to do on-stack replacement, so the best we could do would be to trigger JIT of a method that ran longer than some expectation.

Below is a patch for this, but it's not really accurate because it doesn't exclude child calls.

diff --git a/core/src/main/java/org/jruby/internal/runtime/methods/MixedModeIRMethod.java b/core/src/main/java/org/jruby/internal/runtime/methods/MixedModeIRMethod.java
index 2cdeaf0..7aeee01 100644
--- a/core/src/main/java/org/jruby/internal/runtime/methods/MixedModeIRMethod.java
+++ b/core/src/main/java/org/jruby/internal/runtime/methods/MixedModeIRMethod.java
@@ -97,7 +97,14 @@ public class MixedModeIRMethod extends AbstractIRMethod implements Compilable<Dy
         if (jittedMethod != null) {
             return jittedMethod.call(context, self, clazz, name, args, block);
         } else {
-            return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, args, block);
+            long start = System.currentTimeMillis();
+            try {
+                return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, args, block);
+            } finally {
+                if ((System.currentTimeMillis() - start) > 100) {
+                    context.runtime.getJITCompiler().buildThresholdReached(context, this);
+                }
+            }
         }
     }

@@ -132,7 +139,14 @@ public class MixedModeIRMethod extends AbstractIRMethod implements Compilable<Dy
         if (jittedMethod != null) {
             return jittedMethod.call(context, self, clazz, name, block);
         } else {
-            return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, block);
+            long start = System.currentTimeMillis();
+            try {
+                return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, block);
+            } finally {
+                if ((System.currentTimeMillis() - start) > 100) {
+                    context.runtime.getJITCompiler().buildThresholdReached(context, this);
+                }
+            }
         }
     }

@@ -167,7 +181,14 @@ public class MixedModeIRMethod extends AbstractIRMethod implements Compilable<Dy
         if (jittedMethod != null) {
             return jittedMethod.call(context, self, clazz, name, arg0, block);
         } else {
-            return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, arg0, block);
+            long start = System.currentTimeMillis();
+            try {
+                return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, arg0, block);
+            } finally {
+                if ((System.currentTimeMillis() - start) > 100) {
+                    context.runtime.getJITCompiler().buildThresholdReached(context, this);
+                }
+            }
         }
     }

@@ -202,7 +223,14 @@ public class MixedModeIRMethod extends AbstractIRMethod implements Compilable<Dy
         if (jittedMethod != null) {
             return jittedMethod.call(context, self, clazz, name, arg0, arg1, block);
         } else {
-            return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, arg0, arg1, block);
+            long start = System.currentTimeMillis();
+            try {
+                return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, arg0, arg1, block);
+            } finally {
+                if ((System.currentTimeMillis() - start) > 100) {
+                    context.runtime.getJITCompiler().buildThresholdReached(context, this);
+                }
+            }
         }
     }

@@ -237,7 +265,14 @@ public class MixedModeIRMethod extends AbstractIRMethod implements Compilable<Dy
         if (jittedMethod != null) {
             return jittedMethod.call(context, self, clazz, name, arg0, arg1, arg2, block);
         } else {
-            return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, arg0, arg1, arg2, block);
+            long start = System.currentTimeMillis();
+            try {
+                return INTERPRET_METHOD(context, ensureInstrsReady(), getImplementationClass().getMethodLocation(), self, name, arg0, arg1, arg2, block);
+            } finally {
+                if ((System.currentTimeMillis() - start) > 100) {
+                    context.runtime.getJITCompiler().buildThresholdReached(context, this);
+                }
+            }
         }
     }

I'm not sure what priority to give this.

@giuan
Copy link
Author

giuan commented Feb 6, 2016

But if try https://gist.github.com/giuan/f2176226e253d8620e0a
my method is compiled. I have 0.9 sec. for MRI 2.3 and 1.1 for jruby

@headius
Copy link
Member

headius commented Feb 7, 2016

0.9s versus 1.1s is not bad. And for me, the difference is greater. JRuby wins for both your prime? implementation and for the stdlib version (I set JIT threshold to 0 here to force it to compile right away).

[] ~/projects/jruby $ jruby -Xcompile.invokedynamic -Xjit.threshold=0 -I ../ruby/lib bench_prime2.rb 
       user     system      total        real
 true
  1.560000   0.090000   1.650000 (  0.861703)
 true
  1.110000   0.020000   1.130000 (  0.754386)
 true
  0.700000   0.010000   0.710000 (  0.660860)
 true
  0.960000   0.010000   0.970000 (  0.738980)
 true
  0.780000   0.010000   0.790000 (  0.650650)
 true
  0.530000   0.010000   0.540000 (  0.426186)
 true
  0.650000   0.000000   0.650000 (  0.618836)
 true
  0.420000   0.010000   0.430000 (  0.395621)
 true
  0.640000   0.010000   0.650000 (  0.608281)
 true
  0.420000   0.000000   0.420000 (  0.392981)
 true
  0.650000   0.010000   0.660000 (  0.617567)
 true
  0.430000   0.010000   0.440000 (  0.401770)
 true
  0.660000   0.000000   0.660000 (  0.640085)
 true
  0.430000   0.010000   0.440000 (  0.397853)
 true
  0.750000   0.010000   0.760000 (  0.607031)
 true
  0.450000   0.010000   0.460000 (  0.432919)
 true
  0.670000   0.000000   0.670000 (  0.637849)
 true
  0.410000   0.010000   0.420000 (  0.381010)
 true
  0.620000   0.000000   0.620000 (  0.596011)
 true
  0.500000   0.010000   0.510000 (  0.407787)

[] ~/projects/jruby $ ruby23 -I ../ruby/lib bench_prime2.rb 
       user     system      total        real
 true
  0.730000   0.000000   0.730000 (  0.733017)
 true
  0.710000   0.000000   0.710000 (  0.706568)
 true
  0.690000   0.000000   0.690000 (  0.691615)
 true
  0.680000   0.010000   0.690000 (  0.682329)
 true
  0.620000   0.000000   0.620000 (  0.619963)
 true
  0.670000   0.000000   0.670000 (  0.666560)
 true
  0.650000   0.000000   0.650000 (  0.652076)
 true
  0.680000   0.000000   0.680000 (  0.682871)
 true
  0.640000   0.000000   0.640000 (  0.640788)
 true
  0.700000   0.000000   0.700000 (  0.702633)
 true
  0.740000   0.000000   0.740000 (  0.741795)
 true
  0.700000   0.000000   0.700000 (  0.700159)
 true
  0.680000   0.000000   0.680000 (  0.684462)
 true
  0.730000   0.010000   0.740000 (  0.727324)
 true
  0.670000   0.000000   0.670000 (  0.676201)
 true
  0.710000   0.000000   0.710000 (  0.709593)
 true
  0.680000   0.000000   0.680000 (  0.681053)
 true
  0.720000   0.000000   0.720000 (  0.719188)
 true
  0.670000   0.000000   0.670000 (  0.677244)
 true
  0.720000   0.000000   0.720000 (  0.719593)

@headius
Copy link
Member

headius commented Feb 8, 2016

@enebo @subbuss Any thoughts on triggering JIT sooner than 50 calls when we have a hot loop? I know this falls into the "let's add OSR because benchmarks look better" world, but I wonder if we can at least accelerate the JIT promotion in these cases.

@enebo
Copy link
Member

enebo commented Feb 8, 2016

Well nothing in inlining branch changes when/how we JIT methods (read till end :) ), but we do keep track of checking for hot callsites via thread_polling stats. This could also be used for methods which trigger the most thread_polls. Will that be faster than 50 calls? I don't know, but it seems reasonable to me. fwiw, when inlining+profiling is enabled if we detect a site in that loopy method that needs to be inlined we actually compile the surrounding method. In that sense, how profiling works now might end up compiling earlier...assuming there is a hot ruby-only method in the loop.

@headius
Copy link
Member

headius commented Mar 8, 2016

Ok, I did find a regression for #3626 that prevented blocks from jitting. This was the source of the performance drop in #3626 going from 9.0.4.0 to 9.0.5.0 and it is now fixed. I believe the benchmarks here are back where they should be.

@headius headius closed this as completed Mar 8, 2016
@headius headius added this to the JRuby 9.1.0.0 milestone Mar 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants