-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threads stuck on #3612
Comments
heh - seems like high_scale lib is not 100% non-blocking in the end ... 😞 |
Haha 😈 A bug in Cliff Click's code? |
seems as if line numbers from the above stack do not match 1.0.6 (at least the version on GH) e.g. https://github.com/boundary/high-scale-lib/blob/high-scale-lib-1.0.6/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMapLong.java#L843 or https://github.com/boundary/high-scale-lib/blob/high-scale-lib-1.0.6/src/main/java/org/cliffc/high_scale_lib/NonBlockingHashMapLong.java#L534 .. confusing :( @banker next JRuby version (9.0.5) should have
so they never stop executing or do they after a longer period of time? |
@banker are you sure those threads are really stuck? With @kares those line nums could be valid:
Other than that, I had a look at that hashmap's slot relocation code, I think it is correct and shouldn't be able to get stuck as @banker reports. |
I haven't seen this issue since I reported it, but I'll try to investigate further the next time it happens. What I can tell you is this:
|
@banker Ok, I'm going to close this since none of us can reproduce it right now. If it happens again please provide a complete stack dump (all threads) so we can investigate whether there's perhaps something else getting stuck. |
@fatum would you be willing to run your app with some debug output? Namely uncommenting these lines in // Do not shrink, ever
if( newsz < oldlen ) newsz = oldlen;
- //System.out.println("old="+oldlen+" new="+newsz+" size()="+sz+" est_slots()="+q+" millis="+(tm-_nbhml._last_resize_milli));
+ System.out.println("old="+oldlen+" new="+newsz+" size()="+sz+" est_slots()="+q+" millis="+(tm-_nbhml._last_resize_milli)); _nbhml._last_resize_milli = System.currentTimeMillis(); // Record resize time for next check
- //long nano = System.nanoTime();
- //System.out.println(" "+nano+" Promote table "+oldlen+" to "+_newchm._keys.length);
+ long nano = System.nanoTime();
+ System.out.println(" "+nano+" Promote table "+oldlen+" to "+_newchm._keys.length);
} if( CAS_newchm( newchm ) ) { // NOW a resize-is-in-progress!
//notifyAll(); // Wake up any sleepers
- //long nano = System.nanoTime();
- //System.out.println(" "+nano+" Resize from "+oldlen+" to "+(1<<log2)+" and had "+(_resizers-1)+" extras" );
+ long nano = System.nanoTime();
+ System.out.println(" "+nano+" Resize from "+oldlen+" to "+(1<<log2)+" and had "+(_resizers-1)+" extras" );
//System.out.print("["+log2);
} else // CAS failed?
newchm = _newchm; // Reread new table
return newchm; Feel free to write to a file (instead of |
Hi @thedarkone, nice to e-meet you. Sorry for delay
As you can see they are huge (33M) and continuously growing. This leads to high CPU consuming Seems like we have a leak or smth with jruby? Thanks for your help |
@fatum hi, nice to meet you as well. Wow. 33m, yeah that is a problem for the current impl (it's reprobing heuristics break down for huge tables). I have an idea how to improve on that, I'll try to have a PR ready by the end of the week. One more thing, can you please confirm that the 33m hashmap is the one attached to one of the |
Thanks for your help @thedarkone
Yes, this is Should we use 1.7.x branch to avoid this issue? Thanks, |
for the record (might be related) there was a "leak" if anyone used Java integration's proc-to-interface feature "heavily" (has been resolved in 9.1.8 see #4494). you can rule out if its related if you try 9.1.8.0. |
There were also a couple other leaks (or simply excessive memory use) fixed in 9.1.8.0. I'd really like to see if there's still problems with that release. |
So yes, the cache associated with InstanceMethodInvoker (and others) looks like it can grow without bounds. For a normal application this shouldn't be a lot of data; there are only so many combinations of argument types you might use to call from Ruby to Java. However if something is causing many new classes to be created (e.g. the proc/interface leak @kares mentioned, or our own code generation within JRuby) it's possible you might see this grow endlessly. @kares What would you think about putting a configurably limit on the size of these caches? If we reached some suitable upper bound, we could either stop caching (just search every time) or empty the cache and start over. Most methods that would see lots of types will also have few parameters. Also, any cache will become slower than a linear search at some point...I just don't know what that point is for Java methods and the non-blocking hash. |
@headius makese sense but given ^^ ... would wait to confirm it its still an issue with 9.1.8.0 a bound would be needed, in case of proc-to-iface, only if it would keep passing in blocks with constantly changing (increasing/decreasing) arities - which is quite an extreme - not worth doing. |
Deployed 9.1.8 in one region and after ~20hours of production load (~10-15K RPS) everything is fine:
Thank you for your help! |
@fatum where do I send the invoice than? just kidding, glad to hear that - hopefully this is done for good. thanks for testing this out. |
@kares 👍 PS Before that we tested 9.1.7 and there was leak - we got OOM after couple of hours |
I have a multi-threaded Ruby app running on jruby 9.0.4.0 (2.2.2) 2015-11-12 b9fb7aa Java HotSpot(TM) 64-Bit Server VM 25.45-b02 on 1.8.0_45-b14 +jit [linux-amd64].
This is the same app referenced in #3394.
We are still seeing "stuck" threads, although much less frequently than before (the issue occurs perhaps once a week, sometimes less often). Although the threads are RUNNABLE, they are not making progress. Here's is the relevant portion of the jstack trace:
When this occurs, several threads are stuck on the help_copy_impl method.
The text was updated successfully, but these errors were encountered: