JRuby thread stuck on IntHashMap.rehash #3394

banker · 2015-10-13T16:25:39Z

I have a multi-threaded application running on JRuby 9.0.1.0 (Java 1.8.0_45-b14). The application runs with 8 Ruby threads. One of those threads has been stuck on the IntHashMap.rehash method for the last 13 hours. The remaining threads are all functioning normally. Here is (most of) the stack trace for the thread in question as reported by jstack:

"Ruby-0-Thread-7: /data/apps/my_app/current/vendor/bundle/jruby/2.2.0/gems/my-gem-java/lib/java_executor_simple.rb:22" #21 daemon prio=5 os_prio=0 tid=0x00007f1f585bf800 nid=0x505e runnable [0x00007f1f0db99000]
   java.lang.Thread.State: RUNNABLE
        at org.jruby.util.collections.IntHashMap.rehash(IntHashMap.java:139)
        at org.jruby.util.collections.IntHashMap.put(IntHashMap.java:170)
        at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:118)
        at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:282)
        at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.invokeOther31:time(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.RUBY$method$md?$0(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb:149)
        at java.lang.invoke.LambdaForm$DMH/885284298.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/2130192211.invokeExact_MT(LambdaForm$MH)
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:131)
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:181)
        at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:197)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:161)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.invokeOther3:md?(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.RUBY$block$process$2(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb:182)
        at java.lang.invoke.LambdaForm$DMH/885284298.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/2130192211.invokeExact_MT(LambdaForm$MH)
        at org.jruby.runtime.CompiledIRBlockBody.commonYieldPath(CompiledIRBlockBody.java:70)
        at org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:67)
        at org.jruby.runtime.Block.yieldSpecific(Block.java:116)
        at org.jruby.ir.runtime.IRRuntimeHelpers.yieldSpecific(IRRuntimeHelpers.java:457)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.my_method.RUBY$method$cache$0(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/bdcc.rb:19)
        at java.lang.invoke.LambdaForm$DMH/664223387.invokeStatic_L8_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/684822005.invokeExact_MT(LambdaForm$MH)
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:146)
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:216)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:203)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:209)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.invokeOther8:cache(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.RUBY$block$process$1(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb:176)
        at java.lang.invoke.LambdaForm$DMH/885284298.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/2130192211.invokeExact_MT(LambdaForm$MH)
        at org.jruby.runtime.CompiledIRBlockBody.commonYieldPath(CompiledIRBlockBody.java:70)
        at org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:140)
        at org.jruby.runtime.BlockBody.yield(BlockBody.java:77)

I cannot yet reliably reproduce this, but it does reliably occur once the app has been running for a few hours.

The text was updated successfully, but these errors were encountered:

headius · 2015-10-14T01:36:25Z

That's very strange. Nothing should block in that map, so it sounds like something has put it into an unstable state that loops forever. I'm not surprised you can't reproduce it.

Can you build JRuby? Perhaps you can stop by IRC tomorrow during the day US time and we'll help you look at it? It might be easiest to attach a debugger or console to see what that thread is actually doing.

headius · 2015-10-14T01:37:40Z

@kares I was just looking quickly, and I don't see any synchronization around our overload resolution logic in JI. Can you confirm?

kares · 2015-10-14T05:08:56Z

@headius no sync indeed ... adding the IntHashMap did some improvement - e.g. eliminating some Integer allocations. there was a FIXME around the previous ConcurrentHashMap used as a signature cache which now seems wasn't valid ... some logic to handle concurrent cache updates needs to be done 805ccb8#diff-1284db960bb926e3532f0aa650c24916L39 ... as the map is currently used keys are never replaced in the map, so maybe tweaking or synchronizing the rehash logic will mostly do ?

kares · 2015-10-14T05:14:33Z

simplest thing would be synchronized (cache) on these puts

headius · 2015-10-14T11:08:53Z

Without an atomic solution like ConcurrentHashMap, we will need to synchronize both puts and gets, since gets could see an unstable state in the data structure if a put is active.

headius · 2015-10-14T11:45:05Z

This needs to be resolved before 9.0.2.0.

The move to IntHashMap in RubyToJavaInvoker improved throughput for multiple-arity calls, but since IntHashMap is not synchronized this could cause corrupted data under concurrent load as seen in CallableSelector to encapsulate the cache and synchronize access to it. I believe this is a minimum effort to make the cache access safe and fix #3394. I have not measured the impact of this synchronization in a high concurrency setting. It may indeed turn out to be more overhead and a worse bottleneck than using ConcurrentHashMap with Integer objects. Unfortunately there's no middle ground here unless we build or adopt some concurrency-friendly int-based map. Another heavy option would be to maintain two references to IntHashMap in RubyToJavaInvoker, copying from a synchronized writable copy to a read-only copy after every write. This would double the size of the cache structures in memory but would provide a fast, unsynchronized cache for read-mostly signature lookups. This pattern is also easier to support with the refactoring in this commit.

headius · 2015-10-14T12:05:31Z

@banker If you have the ability to do so, it would be a great help if you could test the code in PR #3395 and possibly also the additional patch I provided in a comment there.

See BUILDING.md for building your own JRuby or poke us on IRC today.

Part of fixes for #3394.

headius · 2015-10-14T14:52:44Z

@banker The fixed IntHashMap usage has been pushed to master and will be in 9.0.2.

banker · 2015-10-14T16:22:11Z

Thanks for looking into this, @headius. I'll build from master and re-run in my environment. Since the bug has taken up to 12 hours to appear, it will take some time for me to verify that the issue is resolved. I'll report back by Friday.

banker · 2015-10-15T16:07:24Z

Bad news, @headius. I ran one instance of my app overnight using the new code, and it's stuck again, but differently.

All 8 of my threads are BLOCKED on a call into a Java metrics library. All of the thread stacks look like this:

"Ruby-0-Thread-4: /data/apps/my_app/current/vendor/bundle/jruby/2.2.0/gems/mygem/lib/mylib/java_executor_simple.rb:22" #1
8 daemon prio=5 os_prio=0 tid=0x00007f3fa5333000 nid=0x3fa9 waiting for monitor entry [0x00007f3f7bdf9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.jruby.java.invokers.RubyToJavaInvoker.putSignature(RubyToJavaInvoker.java:203)
        - waiting to lock <0x00000005c1ae89d0> (a org.jruby.util.collections.IntHashMap)
        at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:66)
        at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:348)
        at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my.app.application.invokeOther31:time(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/mylib/application.rb)

Let me know if I can provide any further information, and thanks again for such a quick response for far!

headius · 2015-10-15T19:36:40Z

@banker I think this is the older code that had a lock involved, because the newest version does not have a lock (object monitor). Can you show me the -v for that JRuby?

headius · 2015-10-15T19:41:57Z

Ahh actually I'm sorry...the new code does have a lock for writes, as in putSignature. But I'm very confused now why it would be hanging, since it's only a single lock, no other locks are involved, and reads should have no locking of their own.

banker · 2015-10-15T20:21:08Z

Right. The only thing I can add is that, anecdotally, this new code is much more likely to trigger this deadlock.

headius · 2015-10-15T20:44:00Z

@banker Harumph. Very confusing. It sounds like there may simply be a bug in our IntHashMap at this point.

headius · 2015-10-15T20:48:07Z

@banker Can you gist a full stack dump of your VM in this bad state? I would expect to see another thread inside IntHashMap that has acquired the lock. My theory at the moment is that IntHashMap just has a bug that causes it to endlessly loop, and the object monitor trace is just a side effect.

thedarkone · 2015-10-15T21:15:51Z

Yeah, I also don't see how all threads could be stuck waiting for that lock.

banker · 2015-10-15T21:24:24Z

@headius Okay, it turns out that one thread was runnable, which I've included below. Can I email you the gist? Not completely comfortable making it public.

"Ruby-0-Thread-9: /data/apps/my_app/current/vendor/bundle/jruby/2.2.0/gems/my-gem/lib/mylibjava_executor_simple.rb:22" #23 daemon prio=5 os_prio=0 tid=0x00007
f3fa533c000 nid=0x3fae runnable [0x00007f3f7b3f4000]
   java.lang.Thread.State: RUNNABLE
        at org.jruby.util.collections.IntHashMap.put(IntHashMap.java:160)
        at org.jruby.util.collections.IntHashMap.clone(IntHashMap.java:406)
        at org.jruby.java.invokers.RubyToJavaInvoker.putSignature(RubyToJavaInvoker.java:205)
        - locked <0x00000005c1ae89d0> (a org.jruby.util.collections.IntHashMap)
        at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:66)
        at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:348)
        at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.foo.myapp.application.invokeOther31:time(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/mylibmyapp/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.foo.myapp.application.RUBY$method$duplicate_content?$0(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/mylibmyapp/application.rb:149)

headius · 2015-10-16T02:27:06Z

@banker Yeah go ahead, headius@headius.com. I'm also pushing a change to master to use a high-throughput, concurrency-friendly map implementation rather than using our own. It should be in nightly builds tomorrow morning.

headius · 2015-10-16T02:31:46Z

After a brief search (and some lazy Twittering) I was pointed at https://github.com/boundary/high-scale-lib, a maintained, maven-released fork of Cliff Click's high-throughput map implementations. They are fast, memory-efficient, concurrency-friendly, and non-blocking. The best part is that there's an implementation that takes a long key, so we can basically drop it in wherever we use IntHashMap.

Here's perf results.

Before (COW IntHashMap):

                                                    user     system      total        real
Control                                         0.590000   0.000000   0.590000 (  0.341363)
Math.abs with byte-ranged Fixnum                1.630000   0.010000   1.640000 (  1.379069)
Math.abs with short-ranged Fixnum               1.730000   0.080000   1.810000 (  1.533326)
Math.abs with int-ranged Fixnum                 1.700000   0.020000   1.720000 (  1.479799)
Math.abs with long-ranged Fixnum                1.660000   0.000000   1.660000 (  1.507284)
Math.abs with float-ranged Float                1.900000   0.020000   1.920000 (  1.671365)
Math.abs with double-ranged Float               1.650000   0.010000   1.660000 (  1.609894)
Contended Math.abs with double-ranged Float     3.950000   0.030000   3.980000 (  0.539143)
Control                                         0.400000   0.000000   0.400000 (  0.389647)
Math.abs with byte-ranged Fixnum                1.570000   0.000000   1.570000 (  1.560834)
Math.abs with short-ranged Fixnum               1.650000   0.010000   1.660000 (  1.617118)
Math.abs with int-ranged Fixnum                 1.660000   0.000000   1.660000 (  1.644870)
Math.abs with long-ranged Fixnum                1.680000   0.010000   1.690000 (  1.659800)
Math.abs with float-ranged Float                1.700000   0.010000   1.710000 (  1.675863)
Math.abs with double-ranged Float               1.690000   0.000000   1.690000 (  1.662213)
Contended Math.abs with double-ranged Float     3.120000   0.020000   3.140000 (  0.444798)

After (NonBlockingHashMapLong

                                                    user     system      total        real
Control                                         0.930000   0.020000   0.950000 (  0.416645)
Math.abs with byte-ranged Fixnum                1.570000   0.010000   1.580000 (  1.350059)
Math.abs with short-ranged Fixnum               1.710000   0.080000   1.790000 (  1.483133)
Math.abs with int-ranged Fixnum                 1.640000   0.020000   1.660000 (  1.422112)
Math.abs with long-ranged Fixnum                1.620000   0.010000   1.630000 (  1.445246)
Math.abs with float-ranged Float                1.950000   0.020000   1.970000 (  1.646687)
Math.abs with double-ranged Float               1.680000   0.010000   1.690000 (  1.591855)
Contended Math.abs with double-ranged Float     3.730000   0.030000   3.760000 (  0.511322)
Control                                         0.400000   0.000000   0.400000 (  0.402811)
Math.abs with byte-ranged Fixnum                1.530000   0.000000   1.530000 (  1.531421)
Math.abs with short-ranged Fixnum               1.620000   0.010000   1.630000 (  1.608175)
Math.abs with int-ranged Fixnum                 1.590000   0.010000   1.600000 (  1.574305)
Math.abs with long-ranged Fixnum                1.540000   0.000000   1.540000 (  1.518596)
Math.abs with float-ranged Float                1.640000   0.010000   1.650000 (  1.617539)
Math.abs with double-ranged Float               1.630000   0.010000   1.640000 (  1.614268)
Contended Math.abs with double-ranged Float     2.920000   0.010000   2.930000 (  0.416824)

Performance is essentially the same. I'm going to push the change, since the library only adds about 100kB to JRuby.

headius · 2015-10-16T02:33:53Z

Reopening one last time. If the new collection works, then it would indicate there's a bug in IntHashMap. In that case, we should also reimpl IntHashMap in terms of NonBlockingHashMapLong in both JRuby 1.7.x and 9k.

thedarkone · 2015-10-16T03:42:41Z

@headius just a note (in case you are unaware of it and end up using Cliff's hash map elsewhere in JRuby), it leaks deleted keys (only keys not values). Obviously, this might not matter depending on use case (no deletions) or, as is the case with NonBlockingHashMapLong, the keys are primitive longs/ints.

headius · 2015-10-16T06:53:16Z

@thedarkone That's good information, thank you. I think we'll probably still be using CHM for our Object:Object maps but this will be a good replacement for IntHashMap, which obviously seems to have some issues.

headius · 2015-10-20T13:46:23Z

@banker Hopefully this has fixed your issues. This change will go out in the release today. I'm going to optimistically close this.

banker · 2015-10-20T19:09:59Z

@headius I'll build again, run, and get back to you. I hope it's fixed, too! Currently just restarting the app on a cron to preemptively avoid :-\

kares · 2015-10-30T10:54:08Z

@banker or anyone ... could you confirm its behaving properly - considering back-porting to jruby-1_7 if so

This should hopefully fix jruby#3394. Conflicts: core/pom.rb core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java

…avaInvoker-s ... introducing an internal CallableCache interface to loose coupling in CallableSelector aligned with main fix for jruby#3394

…avaInvoker-s ... introducing an internal CallableCache interface to loose coupling in CallableSelector aligned with main fix for #3394

This should hopefully fix jruby#3394. Conflicts: core/pom.rb core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java

…avaInvoker-s ... introducing an internal CallableCache interface to loose coupling in CallableSelector aligned with main fix for jruby#3394 this commit also includes pieces of da1b849 and so Conflicts: core/src/main/java/org/jruby/java/dispatch/CallableSelector.java core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java core/src/main/java/org/jruby/javasupport/Java.java

headius · 2015-11-24T17:11:51Z

Also released in 1.7.23.

kares added java integration JRuby 1.7.x JRuby 9000 labels Oct 14, 2015

headius added this to the JRuby 9.0.2.0 milestone Oct 14, 2015

headius mentioned this issue Oct 14, 2015

Refactor Java invoker cache logic to allow encapsulation, sync. #3395

Merged

headius added a commit that referenced this issue Oct 14, 2015

Use copy-on-write IntHashMaps to avoid sync cost on lookup.

f28fd8b

Part of fixes for #3394.

headius closed this as completed in #3395 Oct 14, 2015

enebo reopened this Oct 15, 2015

headius closed this as completed in 22be3c6 Oct 16, 2015

headius reopened this Oct 16, 2015

headius closed this as completed Oct 20, 2015

kares pushed a commit to kares/jruby that referenced this issue Oct 30, 2015

Use Cliff Click's high-scale collections instead of IntHashMap.

cf08f6f

This should hopefully fix jruby#3394. Conflicts: core/pom.rb core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java

kares added a commit that referenced this issue Nov 2, 2015

refactor (internal) Java.__jcreate! impl to use same cache as RubyToJ…

da1b849

…avaInvoker-s ... introducing an internal CallableCache interface to loose coupling in CallableSelector aligned with main fix for #3394

kares pushed a commit to kares/jruby that referenced this issue Nov 11, 2015

Use Cliff Click's high-scale collections instead of IntHashMap.

317a4f3

This should hopefully fix jruby#3394. Conflicts: core/pom.rb core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java

kares mentioned this issue Dec 22, 2015

port NonBlockingHashMapLong from high-scale-lib #3560

Merged

banker mentioned this issue Jan 22, 2016

Threads stuck on #3612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

JRuby thread stuck on IntHashMap.rehash #3394

JRuby thread stuck on IntHashMap.rehash #3394

banker commented Oct 13, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

kares commented Oct 14, 2015

kares commented Oct 14, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

banker commented Oct 14, 2015

banker commented Oct 15, 2015

headius commented Oct 15, 2015

headius commented Oct 15, 2015

banker commented Oct 15, 2015

headius commented Oct 15, 2015

headius commented Oct 15, 2015

thedarkone commented Oct 15, 2015

banker commented Oct 15, 2015

headius commented Oct 16, 2015

headius commented Oct 16, 2015

headius commented Oct 16, 2015

thedarkone commented Oct 16, 2015

headius commented Oct 16, 2015

headius commented Oct 20, 2015

banker commented Oct 20, 2015

kares commented Oct 30, 2015

headius commented Nov 24, 2015

JRuby thread stuck on IntHashMap.rehash #3394

JRuby thread stuck on IntHashMap.rehash #3394

Comments

banker commented Oct 13, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

kares commented Oct 14, 2015

kares commented Oct 14, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

headius commented Oct 14, 2015

banker commented Oct 14, 2015

banker commented Oct 15, 2015

headius commented Oct 15, 2015

headius commented Oct 15, 2015

banker commented Oct 15, 2015

headius commented Oct 15, 2015

headius commented Oct 15, 2015

thedarkone commented Oct 15, 2015

banker commented Oct 15, 2015

headius commented Oct 16, 2015

headius commented Oct 16, 2015

headius commented Oct 16, 2015

thedarkone commented Oct 16, 2015

headius commented Oct 16, 2015

headius commented Oct 20, 2015

banker commented Oct 20, 2015

kares commented Oct 30, 2015

headius commented Nov 24, 2015