Threads stuck on #3612

banker · 2016-01-22T18:51:07Z

I have a multi-threaded Ruby app running on jruby 9.0.4.0 (2.2.2) 2015-11-12 b9fb7aa Java HotSpot(TM) 64-Bit Server VM 25.45-b02 on 1.8.0_45-b14 +jit [linux-amd64].

This is the same app referenced in #3394.

We are still seeing "stuck" threads, although much less frequently than before (the issue occurs perhaps once a week, sometimes less often). Although the threads are RUNNABLE, they are not making progress. Here's is the relevant portion of the jstack trace:

"Ruby-0-Thread-10: #24 daemon prio=5 os_prio=0 tid=0x00007fe6b0578800 nid=0x3be0 runnable [0x00007fe68bdf8000]
   java.lang.Thread.State: RUNNABLE
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.help_copy_impl(NonBlockingHashMapLong.java:843)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.access$300(NonBlockingHashMapLong.java:402)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong.help_copy(NonBlockingHashMapLong.java:396)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong.access$1100(NonBlockingHashMapLong.java:79)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.copy_slot_and_check(NonBlockingHashMapLong.java:868)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.get_impl(NonBlockingHashMapLong.java:534)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.access$200(NonBlockingHashMapLong.java:402)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong.get(NonBlockingHashMapLong.java:360)
  at org.jruby.java.invokers.RubyToJavaInvoker.getSignature(RubyToJavaInvoker.java:205)
  at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:103)
  at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:356)
  at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
  at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
  at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)

When this occurs, several threads are stuck on the help_copy_impl method.

The text was updated successfully, but these errors were encountered:

kares · 2016-01-22T22:08:13Z

heh - seems like high_scale lib is not 100% non-blocking in the end ... 😞

thedarkone · 2016-01-23T22:49:58Z

heh - seems like high_scale lib is not 100% non-blocking in the end ... 😞

Haha 😈

A bug in Cliff Click's code?

kares · 2016-01-24T16:25:23Z

seems as if get (get_impl doing a while true loop) on CHM are causing to concurrently get into a state where 2 threads can not finish their work.

line numbers from the above stack do not match 1.0.6 (at least the version on GH) e.g. https://github.com/boundary/high-scale-lib/blob/high-scale-lib-1.0.6/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMapLong.java#L843 or https://github.com/boundary/high-scale-lib/blob/high-scale-lib-1.0.6/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMapLong.java#L534 .. confusing :(

@banker next JRuby version (9.0.5) should have NonBlockingHashMapLong (unless its ditched) coming from org.jruby so the line numbers should be better.

Although the threads are RUNNABLE, they are not making progress.

so they never stop executing or do they after a longer period of time?

thedarkone · 2016-01-24T21:48:35Z

@banker are you sure those threads are really stuck? With org.cliffc.high_scale_lib.NonBlockingHashMapLong "stuck" means endlessly spinning chewing 100% CPU, they wouldn't be stuck waiting on a lock or anything like that.

@kares those line nums could be valid:

NonBlockingHashMapLong.java#L534 could just be how multi-line ternary statement line number looks like,
NonBlockingHashMapLong.java#L843 is probably where JVM's safepoint (a backbranch of the non-counted loop) is located. When @banker requests backtraces, all threads are brought to safepoints, this is what JVM thinks is the line number that help_copy_impl spinning thread is at (as this is where I expect a safepoint for that JIT-ed code to be).

Other than that, I had a look at that hashmap's slot relocation code, I think it is correct and shouldn't be able to get stuck as @banker reports.

banker · 2016-01-25T17:20:54Z

I haven't seen this issue since I reported it, but I'll try to investigate further the next time it happens. What I can tell you is this:

The app has several long-running threads whose job is to read from a Kafka queue, do some work, and then write to another Kafka queue.
When these threads get "stuck," they stop consuming from the Kafka queue, but maintain their connection to Kafka (and thus create a backlog).
In this state, every call to jstack shows multiple threads executing the help_copy_impl method.
The app never recovers from this state. It must be restarted.

headius · 2016-02-14T10:43:55Z

@banker Ok, I'm going to close this since none of us can reproduce it right now. If it happens again please provide a complete stack dump (all threads) so we can investigate whether there's perhaps something else getting stuck.

fatum · 2017-03-02T19:03:40Z

We're faced on the same trouble (jruby-9.1.2.0)

More details in ratpack's issue:

ratpack/ratpack#1168

thedarkone · 2017-03-06T00:08:10Z

@fatum would you be willing to run your app with some debug output? Namely uncommenting these lines in NonBlockingHashMapLong.java:

line 754

      // Do not shrink, ever
      if( newsz < oldlen ) newsz = oldlen;
-      //System.out.println("old="+oldlen+" new="+newsz+" size()="+sz+" est_slots()="+q+" millis="+(tm-_nbhml._last_resize_milli));
+      System.out.println("old="+oldlen+" new="+newsz+" size()="+sz+" est_slots()="+q+" millis="+(tm-_nbhml._last_resize_milli));

lines 927-928

         _nbhml._last_resize_milli = System.currentTimeMillis(); // Record resize time for next check
-        //long nano = System.nanoTime();
-        //System.out.println(" "+nano+" Promote table "+oldlen+" to "+_newchm._keys.length);
+        long nano = System.nanoTime();
+        System.out.println(" "+nano+" Promote table "+oldlen+" to "+_newchm._keys.length);
       }

lines 798-799

      if( CAS_newchm( newchm ) ) { // NOW a resize-is-in-progress!
        //notifyAll();            // Wake up any sleepers
-        //long nano = System.nanoTime();
-        //System.out.println(" "+nano+" Resize from "+oldlen+" to "+(1<<log2)+" and had "+(_resizers-1)+" extras" );
+        long nano = System.nanoTime();
+        System.out.println(" "+nano+" Resize from "+oldlen+" to "+(1<<log2)+" and had "+(_resizers-1)+" extras" );
        //System.out.print("["+log2);
      } else                    // CAS failed?
        newchm = _newchm;       // Reread new table
      return newchm;

Feel free to write to a file (instead of System.out) or whatever works for your environment. What I am interested in is how large do these hashtables get? You can also try it in your dev/staging, just to check if the tables grow to 100k+ (or 1m+) entries. Inspecting heap dump might also do the trick (here you'd want to know the size of the NonBlockingHashMapLong$CHM._keys long[] array).

fatum · 2017-03-14T13:42:32Z

Hi @thedarkone, nice to e-meet you. Sorry for delay

What I am interested in is how large do these hashtables get?

As you can see they are huge (33M) and continuously growing. This leads to high CPU consuming

Seems like we have a leak or smth with jruby? Thanks for your help

Heap is stable

But native memory is leaking

thedarkone · 2017-03-14T19:51:05Z

@fatum hi, nice to meet you as well.

Wow. 33m, yeah that is a problem for the current impl (it's reprobing heuristics break down for huge tables). I have an idea how to improve on that, I'll try to have a PR ready by the end of the week.

One more thing, can you please confirm that the 33m hashmap is the one attached to one of the RubyToJavaInvoker's subclasses (ConstructorInvoker, StaticMethodInvoker, InstanceMethodInvoker, SingletonMethodInvoker)? Either via heapdump or a manual thread dump. Samples don't help, the hashmap is used in a multiple places by JRuby, we need the full stacktrace to see which one (which class) grows that big, e.g. for @banker it seems to be InstanceMethodInvoker (but in your case it might some other part of JRuby that is acting out).

fatum · 2017-03-15T12:30:52Z

Thanks for your help @thedarkone

One more thing, can you please confirm that the 33m hashmap is the one attached to one of the RubyToJavaInvoker's subclasses (ConstructorInvoker, StaticMethodInvoker, InstanceMethodInvoker, SingletonMethodInvoker)?

Yes, this is InstanceMethodInvoker.

Should we use 1.7.x branch to avoid this issue?

Thanks,
Max

kares · 2017-03-15T13:51:58Z

for the record (might be related) there was a "leak" if anyone used Java integration's proc-to-interface feature "heavily" (has been resolved in 9.1.8 see #4494). you can rule out if its related if you try 9.1.8.0.

headius · 2017-03-15T16:51:58Z

There were also a couple other leaks (or simply excessive memory use) fixed in 9.1.8.0. I'd really like to see if there's still problems with that release.

headius · 2017-03-15T16:56:22Z

So yes, the cache associated with InstanceMethodInvoker (and others) looks like it can grow without bounds. For a normal application this shouldn't be a lot of data; there are only so many combinations of argument types you might use to call from Ruby to Java. However if something is causing many new classes to be created (e.g. the proc/interface leak @kares mentioned, or our own code generation within JRuby) it's possible you might see this grow endlessly.

@kares What would you think about putting a configurably limit on the size of these caches? If we reached some suitable upper bound, we could either stop caching (just search every time) or empty the cache and start over. Most methods that would see lots of types will also have few parameters. Also, any cache will become slower than a linear search at some point...I just don't know what that point is for Java methods and the non-blocking hash.

kares · 2017-03-15T18:44:35Z

We're faced on the same trouble (jruby-9.1.2.0)

@headius makese sense but given ^^ ... would wait to confirm it its still an issue with 9.1.8.0

a bound would be needed, in case of proc-to-iface, only if it would keep passing in blocks with constantly changing (increasing/decreasing) arities - which is quite an extreme - not worth doing.
than there's also the case with always changing types passed in e.g. generated classes which might be worth the bound if its the case for @fatum or others

fatum · 2017-03-16T07:38:20Z

Deployed 9.1.8 in one region and after ~20hours of production load (~10-15K RPS) everything is fine:

CPU load much stable (no peaks)
Performance is improved
long[] doesnt accumulate now

Thank you for your help!

kares · 2017-03-16T07:59:16Z

@fatum where do I send the invoice than? just kidding, glad to hear that - hopefully this is done for good. thanks for testing this out.

fatum · 2017-03-16T08:05:52Z

@kares 👍

PS Before that we tested 9.1.7 and there was leak - we got OOM after couple of hours

kares added java integration JRuby 1.7.x JRuby 9000 labels Jan 22, 2016

headius closed this as completed Feb 14, 2016

headius added this to the Invalid or Duplicate milestone Feb 14, 2016

fatum mentioned this issue Mar 16, 2017

Troubles with HttpClient ratpack/ratpack#1168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Threads stuck on #3612

Threads stuck on #3612

banker commented Jan 22, 2016

kares commented Jan 22, 2016

thedarkone commented Jan 23, 2016

kares commented Jan 24, 2016

thedarkone commented Jan 24, 2016

banker commented Jan 25, 2016

headius commented Feb 14, 2016

fatum commented Mar 2, 2017

thedarkone commented Mar 6, 2017

fatum commented Mar 14, 2017

thedarkone commented Mar 14, 2017

fatum commented Mar 15, 2017 •

edited

Loading

kares commented Mar 15, 2017 •

edited

Loading

headius commented Mar 15, 2017

headius commented Mar 15, 2017

kares commented Mar 15, 2017

fatum commented Mar 16, 2017

kares commented Mar 16, 2017

fatum commented Mar 16, 2017

Threads stuck on #3612

Threads stuck on #3612

Comments

banker commented Jan 22, 2016

kares commented Jan 22, 2016

thedarkone commented Jan 23, 2016

kares commented Jan 24, 2016

thedarkone commented Jan 24, 2016

banker commented Jan 25, 2016

headius commented Feb 14, 2016

fatum commented Mar 2, 2017

thedarkone commented Mar 6, 2017

fatum commented Mar 14, 2017

thedarkone commented Mar 14, 2017

fatum commented Mar 15, 2017 • edited Loading

kares commented Mar 15, 2017 • edited Loading

headius commented Mar 15, 2017

headius commented Mar 15, 2017

kares commented Mar 15, 2017

fatum commented Mar 16, 2017

kares commented Mar 16, 2017

fatum commented Mar 16, 2017

fatum commented Mar 15, 2017 •

edited

Loading

kares commented Mar 15, 2017 •

edited

Loading