Blowing Java stack can leave ObjectProxyCache deadlocked #3857

jmiettinen · 2016-05-06T08:49:39Z

Environment

All JRuby versions in Github

Expected Behavior

Running out of Java stack does not deadlock JRuby-Java interop

Actual Behavior

Needing a proxied object when there's a deep call stack, locks for ObjectProxyCache segments can stay unreleased as calling the finally parts throw StackOverflowError.

This leads eventual deadlocking of application that uses proxied objects.

Using monitor-based locking does not have this problem.

The text was updated successfully, but these errors were encountered:

kares · 2016-05-06T10:20:12Z

thanks, would it be possible to have a reproduction test-case (as I was asking for on IRC already).

headius · 2016-05-09T21:12:44Z

StackOverflowError is generally considered to be a fatal error, since it can do all sorts of nasty things to runtime state. In general we don't make any guarantees about the robustness of JRuby after a stack overflow.

However, seems in this case we could at least try to unlock the lock, since without that we might have other threads stuck forever.

headius · 2016-05-09T21:15:55Z

Ahh, I see you mentioned that the SOE is raised in the finally. I'm not sure there's anything we can do about this. There are similar locks all over JRuby, and this is exactly why SOE is usually considered fatal.

jmiettinen · 2016-05-10T07:55:23Z

I guess this is a choice between using ReentrantLocks and monitor-based locking. The latter are not suspectible to this, but the runtime might still be broken by other stuff not being run in finally-blocks.

headius · 2016-05-10T21:39:10Z

Monitor-based locking is also likely heavier than the ReentrantLock implementation. However I'm not sure it matters as much now that we only use ObjectProxyCache for objects that actually need idempotence. We might not be hitting it hard enough to matter anymore.

I also wish the code in ObjectProxyCache was a bit more approachable. Switching this all to monitors would be an interesting job.

jmiettinen · 2016-05-12T12:29:59Z

We've encountered this live, so this is not a totally theoretical problem.

However, in cases where we've blown the stack, there seems to be some odd things happening.
Is there a way a JRuby internals stack frame would consume more than ~200 bytes (say, 16 * 8 bytes for registers and then still 10 variables more)?

We've had stack blown with just ~70 Ruby frames which, based on this estimate should take around 70 kB (70 Ruby frames * 5 Java Frames / Ruby Frame * 200 bytes / Java Frame).

We have -Djruby.compile.invokedynamic=false so that should not play a role here.

Thus, I don't think this should happen in normal usage. We have something odd going on either in Rails or in how JRuby reports the stack in back trace.

kares · 2016-05-12T15:42:02Z

sounds like this could use some detailed examination. knowing more about the problematic stack and in general about the app could help resolvingthe issue.

headius · 2016-05-12T22:19:56Z

May I ask how you are deploying? JRuby's launcher at startup bumps the default JVM thread stack size up to 2MB. If you are deploying in an embedded scenario or launching JRuby without one of our launchers (e.g. java -jar ...), it will use the JVM's default (1MB on my system).

We bump it up because yes, JRuby does consume a fair bit of stack. We work to reduce this periodically, but having an interpreter means we'll always use more stack than other JVM languages.

The answer to this bug is most likely one of the following:

Some library is recursing too deeply (in error). Fix it.
JRuby itself is consuming more stack than the JVM provides. Bump up the JVM value.

I did ask around on Twitter and received pointers to OpenJDK9 JEP-270, which seeks to provide reserved stack space for operations like locking and unlocking, to help ensure they never blow the stack. However stack overflow is always going to be an issue, and as an asynchronous exception it's always going to be considered fatal.

We're happy to help you investigate the actual stack overflow. That's the problem we should be chasing here. Open an issue for that, please :-)

headius · 2016-05-12T22:23:44Z

I did think of one possible way we could improve this that's kinda silly and probably very fragile: make sure that lock and unlock deepen the stack exactly the same amount. That way, if we're at the end of the stack in the current method, lock will SOE and we'll never get to the deadlock.

I will look at that for a moment.

headius · 2016-05-12T22:24:22Z

Such a fix would be fragile because we don't control what happens at the JDK level, and they may not balance these methods properly at some point.

It's silly because it would work pretty well despite being fragile :-)

headius · 2016-05-12T22:43:08Z

Yeah no dice I'm afraid. The best we can hope is that the JDK implementation of this logic is balanced (it appears to be, but there's many different paths) and that the JVM JITs them so they use balanced amounts of stack (which may be unrealistic to ever expect).

In any case I am back to thinking there's nothing we can do but look at the original SOE. Toss what you know in another issue and we'll look into it.

jmiettinen · 2016-05-13T07:16:40Z

Yeah, balancing the ReentrantLocks isn't something that's very doable, only monitors work there.
And yeah, we're deploying a Warbler package to Tomcat, and I just noticed that we have default stack size (1 MB) there.

But still, is my arithmetic totally off on the used stack as my estimates are one magnitude smaller (~70 KB) than what amount of memory should be available.

headius · 2016-05-13T12:35:24Z

Your math isn't wrong, but the Ruby stack trace you see is likely misleading you. Normally we don't try to rethrow Java's StackOverflowError as a Ruby SystemStackError because it requires catching SOE all over the place. There are, I believe a few places that still do it. What is likely happening is that you're getting an SOE fairly deep into Ruby, but it doesn't get reported until much further up the stack. Feel free to add more info about your env. Without getting a full Java trace from the SOE it is hard to know how deep it actually was getting. I can say that I was able to recurse a single-variable Ruby method about 1000 times with 2MB stack, and with a 5MB stack it shoots up to 10k recursion (probably because more JIT starts kicking in.

kares added java integration JRuby 1.7.x JRuby 9000 labels May 6, 2016

headius closed this as completed May 12, 2016

headius added this to the Won't Fix milestone May 12, 2016

headius reopened this May 12, 2016

headius closed this as completed May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Blowing Java stack can leave ObjectProxyCache deadlocked #3857

Blowing Java stack can leave ObjectProxyCache deadlocked #3857

jmiettinen commented May 6, 2016

kares commented May 6, 2016

headius commented May 9, 2016

headius commented May 9, 2016

jmiettinen commented May 10, 2016

headius commented May 10, 2016

jmiettinen commented May 12, 2016

kares commented May 12, 2016

headius commented May 12, 2016

headius commented May 12, 2016

headius commented May 12, 2016

headius commented May 12, 2016

jmiettinen commented May 13, 2016

headius commented May 13, 2016 via email

Blowing Java stack can leave ObjectProxyCache deadlocked #3857

Blowing Java stack can leave ObjectProxyCache deadlocked #3857

Comments

jmiettinen commented May 6, 2016

Environment

Expected Behavior

Actual Behavior

kares commented May 6, 2016

headius commented May 9, 2016

headius commented May 9, 2016

jmiettinen commented May 10, 2016

headius commented May 10, 2016

jmiettinen commented May 12, 2016

kares commented May 12, 2016

headius commented May 12, 2016

headius commented May 12, 2016

headius commented May 12, 2016

headius commented May 12, 2016

jmiettinen commented May 13, 2016

headius commented May 13, 2016 via email