Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads stuck on #3612

Closed
banker opened this issue Jan 22, 2016 · 18 comments
Closed

Threads stuck on #3612

banker opened this issue Jan 22, 2016 · 18 comments

Comments

@banker
Copy link

banker commented Jan 22, 2016

I have a multi-threaded Ruby app running on jruby 9.0.4.0 (2.2.2) 2015-11-12 b9fb7aa Java HotSpot(TM) 64-Bit Server VM 25.45-b02 on 1.8.0_45-b14 +jit [linux-amd64].

This is the same app referenced in #3394.

We are still seeing "stuck" threads, although much less frequently than before (the issue occurs perhaps once a week, sometimes less often). Although the threads are RUNNABLE, they are not making progress. Here's is the relevant portion of the jstack trace:

"Ruby-0-Thread-10: #24 daemon prio=5 os_prio=0 tid=0x00007fe6b0578800 nid=0x3be0 runnable [0x00007fe68bdf8000]
   java.lang.Thread.State: RUNNABLE
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.help_copy_impl(NonBlockingHashMapLong.java:843)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.access$300(NonBlockingHashMapLong.java:402)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong.help_copy(NonBlockingHashMapLong.java:396)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong.access$1100(NonBlockingHashMapLong.java:79)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.copy_slot_and_check(NonBlockingHashMapLong.java:868)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.get_impl(NonBlockingHashMapLong.java:534)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong$CHM.access$200(NonBlockingHashMapLong.java:402)
  at org.cliffc.high_scale_lib.NonBlockingHashMapLong.get(NonBlockingHashMapLong.java:360)
  at org.jruby.java.invokers.RubyToJavaInvoker.getSignature(RubyToJavaInvoker.java:205)
  at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:103)
  at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:356)
  at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
  at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
  at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)

When this occurs, several threads are stuck on the help_copy_impl method.

@kares
Copy link
Member

kares commented Jan 22, 2016

heh - seems like high_scale lib is not 100% non-blocking in the end ... 😞

@thedarkone
Copy link
Contributor

heh - seems like high_scale lib is not 100% non-blocking in the end ... 😞

Haha 😈

A bug in Cliff Click's code?

@kares
Copy link
Member

kares commented Jan 24, 2016

seems as if get (get_impl doing a while true loop) on CHM are causing to concurrently get into a state where 2 threads can not finish their work.

line numbers from the above stack do not match 1.0.6 (at least the version on GH) e.g. https://github.com/boundary/high-scale-lib/blob/high-scale-lib-1.0.6/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMapLong.java#L843 or https://github.com/boundary/high-scale-lib/blob/high-scale-lib-1.0.6/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMapLong.java#L534 .. confusing :(

@banker next JRuby version (9.0.5) should have NonBlockingHashMapLong (unless its ditched) coming from org.jruby so the line numbers should be better.

Although the threads are RUNNABLE, they are not making progress.

so they never stop executing or do they after a longer period of time?

@thedarkone
Copy link
Contributor

@banker are you sure those threads are really stuck? With org.cliffc.high_scale_lib.NonBlockingHashMapLong "stuck" means endlessly spinning chewing 100% CPU, they wouldn't be stuck waiting on a lock or anything like that.

@kares those line nums could be valid:

  • NonBlockingHashMapLong.java#L534 could just be how multi-line ternary statement line number looks like,
  • NonBlockingHashMapLong.java#L843 is probably where JVM's safepoint (a backbranch of the non-counted loop) is located. When @banker requests backtraces, all threads are brought to safepoints, this is what JVM thinks is the line number that help_copy_impl spinning thread is at (as this is where I expect a safepoint for that JIT-ed code to be).

Other than that, I had a look at that hashmap's slot relocation code, I think it is correct and shouldn't be able to get stuck as @banker reports.

@banker
Copy link
Author

banker commented Jan 25, 2016

I haven't seen this issue since I reported it, but I'll try to investigate further the next time it happens. What I can tell you is this:

  1. The app has several long-running threads whose job is to read from a Kafka queue, do some work, and then write to another Kafka queue.
  2. When these threads get "stuck," they stop consuming from the Kafka queue, but maintain their connection to Kafka (and thus create a backlog).
  3. In this state, every call to jstack shows multiple threads executing the help_copy_impl method.
  4. The app never recovers from this state. It must be restarted.

@headius
Copy link
Member

headius commented Feb 14, 2016

@banker Ok, I'm going to close this since none of us can reproduce it right now. If it happens again please provide a complete stack dump (all threads) so we can investigate whether there's perhaps something else getting stuck.

@headius headius closed this as completed Feb 14, 2016
@headius headius added this to the Invalid or Duplicate milestone Feb 14, 2016
@fatum
Copy link

fatum commented Mar 2, 2017

We're faced on the same trouble (jruby-9.1.2.0)
screen shot 2017-03-02 at 21 46 14
screen shot 2017-02-28 at 16 01 50

More details in ratpack's issue:

ratpack/ratpack#1168

@thedarkone
Copy link
Contributor

@fatum would you be willing to run your app with some debug output? Namely uncommenting these lines in NonBlockingHashMapLong.java:

      // Do not shrink, ever
      if( newsz < oldlen ) newsz = oldlen;
-      //System.out.println("old="+oldlen+" new="+newsz+" size()="+sz+" est_slots()="+q+" millis="+(tm-_nbhml._last_resize_milli));
+      System.out.println("old="+oldlen+" new="+newsz+" size()="+sz+" est_slots()="+q+" millis="+(tm-_nbhml._last_resize_milli));
         _nbhml._last_resize_milli = System.currentTimeMillis(); // Record resize time for next check
-        //long nano = System.nanoTime();
-        //System.out.println(" "+nano+" Promote table "+oldlen+" to "+_newchm._keys.length);
+        long nano = System.nanoTime();
+        System.out.println(" "+nano+" Promote table "+oldlen+" to "+_newchm._keys.length);
       }
      if( CAS_newchm( newchm ) ) { // NOW a resize-is-in-progress!
        //notifyAll();            // Wake up any sleepers
-        //long nano = System.nanoTime();
-        //System.out.println(" "+nano+" Resize from "+oldlen+" to "+(1<<log2)+" and had "+(_resizers-1)+" extras" );
+        long nano = System.nanoTime();
+        System.out.println(" "+nano+" Resize from "+oldlen+" to "+(1<<log2)+" and had "+(_resizers-1)+" extras" );
        //System.out.print("["+log2);
      } else                    // CAS failed?
        newchm = _newchm;       // Reread new table
      return newchm;

Feel free to write to a file (instead of System.out) or whatever works for your environment. What I am interested in is how large do these hashtables get? You can also try it in your dev/staging, just to check if the tables grow to 100k+ (or 1m+) entries. Inspecting heap dump might also do the trick (here you'd want to know the size of the NonBlockingHashMapLong$CHM._keys long[] array).

@fatum
Copy link

fatum commented Mar 14, 2017

Hi @thedarkone, nice to e-meet you. Sorry for delay

What I am interested in is how large do these hashtables get?

As you can see they are huge (33M) and continuously growing. This leads to high CPU consuming

Seems like we have a leak or smth with jruby? Thanks for your help

Heap Dump

Heap is stable
Heap

But native memory is leaking
Native memory
CPU Samples

@thedarkone
Copy link
Contributor

@fatum hi, nice to meet you as well.

Wow. 33m, yeah that is a problem for the current impl (it's reprobing heuristics break down for huge tables). I have an idea how to improve on that, I'll try to have a PR ready by the end of the week.


One more thing, can you please confirm that the 33m hashmap is the one attached to one of the RubyToJavaInvoker's subclasses (ConstructorInvoker, StaticMethodInvoker, InstanceMethodInvoker, SingletonMethodInvoker)? Either via heapdump or a manual thread dump. Samples don't help, the hashmap is used in a multiple places by JRuby, we need the full stacktrace to see which one (which class) grows that big, e.g. for @banker it seems to be InstanceMethodInvoker (but in your case it might some other part of JRuby that is acting out).

@fatum
Copy link

fatum commented Mar 15, 2017

Thanks for your help @thedarkone

One more thing, can you please confirm that the 33m hashmap is the one attached to one of the RubyToJavaInvoker's subclasses (ConstructorInvoker, StaticMethodInvoker, InstanceMethodInvoker, SingletonMethodInvoker)?

Yes, this is InstanceMethodInvoker.

Should we use 1.7.x branch to avoid this issue?

Thanks,
Max

@kares
Copy link
Member

kares commented Mar 15, 2017

for the record (might be related) there was a "leak" if anyone used Java integration's proc-to-interface feature "heavily" (has been resolved in 9.1.8 see #4494). you can rule out if its related if you try 9.1.8.0.

@headius
Copy link
Member

headius commented Mar 15, 2017

There were also a couple other leaks (or simply excessive memory use) fixed in 9.1.8.0. I'd really like to see if there's still problems with that release.

@headius
Copy link
Member

headius commented Mar 15, 2017

So yes, the cache associated with InstanceMethodInvoker (and others) looks like it can grow without bounds. For a normal application this shouldn't be a lot of data; there are only so many combinations of argument types you might use to call from Ruby to Java. However if something is causing many new classes to be created (e.g. the proc/interface leak @kares mentioned, or our own code generation within JRuby) it's possible you might see this grow endlessly.

@kares What would you think about putting a configurably limit on the size of these caches? If we reached some suitable upper bound, we could either stop caching (just search every time) or empty the cache and start over. Most methods that would see lots of types will also have few parameters. Also, any cache will become slower than a linear search at some point...I just don't know what that point is for Java methods and the non-blocking hash.

@kares
Copy link
Member

kares commented Mar 15, 2017

We're faced on the same trouble (jruby-9.1.2.0)

@headius makese sense but given ^^ ... would wait to confirm it its still an issue with 9.1.8.0

a bound would be needed, in case of proc-to-iface, only if it would keep passing in blocks with constantly changing (increasing/decreasing) arities - which is quite an extreme - not worth doing.
than there's also the case with always changing types passed in e.g. generated classes which might be worth the bound if its the case for @fatum or others

@fatum
Copy link

fatum commented Mar 16, 2017

Deployed 9.1.8 in one region and after ~20hours of production load (~10-15K RPS) everything is fine:

  • CPU load much stable (no peaks)
  • Performance is improved
  • long[] doesnt accumulate now

Thank you for your help!

@kares
Copy link
Member

kares commented Mar 16, 2017

@fatum where do I send the invoice than? just kidding, glad to hear that - hopefully this is done for good. thanks for testing this out.

@fatum
Copy link

fatum commented Mar 16, 2017

@kares 👍

PS Before that we tested 9.1.7 and there was leak - we got OOM after couple of hours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants