Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-raising the cause of a Java exception can result in leaked memory and IllegalStateExceptions #4699

Closed
marshalium opened this issue Jun 30, 2017 · 15 comments

Comments

@marshalium
Copy link
Contributor

Environment

$ jruby -v
jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) 64-Bit Server VM 25.121-b13 on 1.8.0_121-b13 +jit [darwin-x86_64]
$ uname -a
Darwin marshall-work-lappy.local 15.6.0 Darwin Kernel Version 15.6.0: Tue Apr 11 16:00:51 PDT 2017; root:xnu-3248.60.11.5.3~1/RELEASE_X86_64 x86_64

I only reproduced this using Guava, but I believe it is not a Guava specific bug. If you'd like me to reproduce without an external dependency I'm willing to try, but it was easier to just use Gauva.

Expected Behavior

Re-raising the cause of a Java exception should work without error or leaked memory.

This script reproduces part of the issue. It runs successfully on 1.7.27 but fails on 9.1.12.0.

To run, download the Guava jar https://repo1.maven.org/maven2/com/google/guava/guava/22.0/guava-22.0.jar, and place it this script:

#!/usr/bin/env jruby

NUM_THREADS = 5

puts RUBY_DESCRIPTION

require_relative 'guava-22.0.jar'

class CustomError < StandardError; end

class CacheLoader < com.google.common.cache.CacheLoader
  def load(key)
    raise CustomError
  end
end

class LoadingCache

  def initialize
    @cache = com.google.common.cache.CacheBuilder.newBuilder.build(CacheLoader.new)
  end

  def get(key)
    begin
      begin
        @cache.get(key)
      rescue com.google.common.util.concurrent.UncheckedExecutionException => e
        raise e.cause
      end
    rescue CustomError
      nil
    end
  end
end

shared_cache = LoadingCache.new

puts "Starting #{NUM_THREADS} threads"
threads = NUM_THREADS.times.map do
  Thread.new do
    begin
      10_000.times do
        shared_cache.get('asdf')
      end
      puts "#{Thread.current} finished successfully!\n"
    rescue => e
      puts "#{Thread.current} failed"
      puts "#{e.class}: #{e}\n\t#{e.backtrace.join("\n\t")}\n"
    end
  end
end
puts "Waiting for all threads to finish"

threads.map(&:join)

puts "DONE"

Actual Behavior

IllegalStateExceptions

Running the above script on 9.1.12.0 results in all threads except for one dying from IllegalStateExceptions. It does not fail if NUM_THREADS = 1.

$ jruby java_cause_bug.rb 
jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) 64-Bit Server VM 25.121-b13 on 1.8.0_121-b13 +jit [darwin-x86_64]
Starting 5 threads
Waiting for all threads to finish
#<Thread:0x7648d86d> failed
Java::JavaLang::IllegalStateException: Can't overwrite cause with com.google.common.util.concurrent.UncheckedExecutionException: org.jruby.exceptions.RaiseException: (CustomError) CustomError
  java.lang.Throwable.initCause(Throwable.java:457)
  org.jruby.RubyKernel.maybeRaiseJavaException(RubyKernel.java:908)
  org.jruby.RubyKernel.raise(RubyKernel.java:841)
  org.jruby.RubyKernel$INVOKER$s$0$3$raise.call(RubyKernel$INVOKER$s$0$3$raise.gen)
  org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:204)
  org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:200)
  org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:338)
  org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:163)
  java_cause_bug.invokeOther52:raise(java_cause_bug.rb:28)
  java_cause_bug.RUBY$method$get$8(java_cause_bug.rb:28)
  org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:168)
  org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:161)
  java_cause_bug.invokeOther1:get(java_cause_bug.rb:43)
  java_cause_bug.RUBY$block$\=java_cause_bug\,rb$2(java_cause_bug.rb:43)
  org.jruby.runtime.CompiledIRBlockBody.yieldDirect(CompiledIRBlockBody.java:156)
  org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:80)
  org.jruby.runtime.Block.yieldSpecific(Block.java:134)
  org.jruby.RubyFixnum.times(RubyFixnum.java:299)
  org.jruby.RubyFixnum$INVOKER$i$0$0$times.call(RubyFixnum$INVOKER$i$0$0$times.gen)
  org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
  org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
  java_cause_bug.invokeOther4:times(java_cause_bug.rb:42)
  java_cause_bug.RUBY$block$\=java_cause_bug\,rb$1(java_cause_bug.rb:42)
  org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:145)
  org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:71)
  org.jruby.runtime.Block.call(Block.java:124)
  org.jruby.RubyProc.call(RubyProc.java:289)
  org.jruby.RubyProc.call(RubyProc.java:246)
  org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:104)
  java.lang.Thread.run(Thread.java:745)
#<Thread:0xcdfc808> failed
[... removed stack since it's the same as above ...]
#<Thread:0x2b046212> failed
[... removed stack since it's the same as above ...]
#<Thread:0x552f9d5a> failed
[... removed stack since it's the same as above ...]

#<Thread:0x1275b47c> finished successfully!
DONE

Memory leaks

Regarding the memory leak, I have not yet been able to reproduce outside of production but running code similar to what is above resulted in several threads that each had a single 1.1GB RubyArrayOne object

leaked_obj

that contained many ConcreteJavaProxys that wrapped the exception being thrown:

leaked_concrete_java_proxy

I changed our prod code from

begin
  foo
rescue com.google.common.util.concurrent.UncheckedExecutionException => e
  raise e.cause
end

to

begin
  return foo
rescue com.google.common.util.concurrent.UncheckedExecutionException => e
end
raise e

and the memory leak went away. So I believe that re-raising a cause from inside the rescue in cases like this can also result in circular dependencies that leak memory.

The memory leak occurred on Linux with JRuby 9.1.7.0:

$ uname -a
Linux someserver 2.6.32-696.3.1.el6.x86_64 #1 SMP Tue May 30 19:52:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
jruby 9.1.7.0 (2.3.1) 2017-01-11 68056ae OpenJDK 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [linux-x86_64]

Let me know if there is any more useful detail that I should add!

@kares
Copy link
Member

kares commented Jun 30, 2017

interesting, so the leak was only relevant on 9.1.7.0 (am confused since the reports starts with 9.1.12) ?

@kovyrin
Copy link

kovyrin commented Jun 30, 2017

We @swiftype have 9.1.7.0 in production and that's where we've observed the leak. We've rolled back, updated the code and deployed the changed version to fix the leak. 9.1.12.0 is mentioned because it was one of the versions tested locally while trying to reproduce the issue locally in development.

@enebo
Copy link
Member

enebo commented Aug 30, 2017

@kovyrin so one theory would be we are storing the result of cause in a temporary variable in that rescue and then raising out the value of that temp. Since we are not niling out temporary variable it would leak. HOWEVER, this is unwinding the stack because we are raising so that temporary variable should no longer be in scope and freed. If I was to keep with this theory this thread basically dies here and does not unwind which would then leak...but then I think you would have lots of threads laying around. So probably not that? hmm

@headius
Copy link
Member

headius commented Aug 30, 2017

Ok, so there seems to be two issues here.

  1. The setCause error when raising a Java exception. This appears to be due to logic in Kernel#raise that, when given a Java exception, attempts to set its cause to whatever the current Ruby "cause" exception is, if that exception was also a Java exception.
  2. The leaking of memory. This may be a side effect of the first problem, since these exceptions could easily be happening from during resource finalization (ensures, etc).

The first problem is caused by two conditions in Throwable.initCause:

  1. The cause may not be initialized twice. This is the source of the reported stack trace.
  2. The cause for a throwable may not be set to that throwable. We do check this case.

I will fix the double-init issue. @marshalium would you be able to test a master or 9.1.13 HEAD build?

@headius
Copy link
Member

headius commented Aug 30, 2017

I've pushed a fix to master and jruby-9.1.13.0 branches.

@marshalium
Copy link
Contributor Author

@headius Thanks for looking into this! Unfortunately, I ran the script from from my original comment on the jruby-9.1.13.0 branch and am still seeing the same errors.

I dug into this a bit, and I think I discovered more of what's actually going on, but I don't know how to fix it.

I added code to RubyKernel.maybeRaiseJavaException(), right before the ex.initCause call, to log thread and object ids:

if (maybeThrowable instanceof Throwable && ex != maybeThrowable && ex.getCause() == null) {
    System.out.format("Setting cause; thread_id=%s, ex.exception=%s, ex.id=%s, ex.cause=%s\n", Thread.currentThread().getId(), ((RaiseException) ex).getException(), System.identityHashCode(ex), ex.getCause());
    ex.initCause((Throwable) maybeThrowable);
}

And what it logged showed that multiple threads are trying to set the cause on the exact same exception object:

Setting cause; thread_id=13, ex.exception=CustomError, ex.id=500827034, ex.cause=null
Setting cause; thread_id=16, ex.exception=CustomError, ex.id=500827034, ex.cause=null
Setting cause; thread_id=14, ex.exception=CustomError, ex.id=500827034, ex.cause=null
Setting cause; thread_id=17, ex.exception=CustomError, ex.id=500827034, ex.cause=null
Setting cause; thread_id=15, ex.exception=CustomError, ex.id=500827034, ex.cause=null

The check that got added didn't fix anything since this is race condition where multiple threads can see the cause as null but only one of them can win when calling initCause. Also, ex.getCause() == null was already being checked on line 904 so I believe the check added is unncessary even in single threaded situations.

The reason I was seeing this with Guava CacheLoader is that when multiple threads try to load the same key in parallel, threads after the first one block on a future. Once the loading thread sets an error on the future the waiting threads see it and raise an UncheckedExecutionException whose cause is set to the error instance from the future. When the JRuby code re-raises the cause of the UncheckedExecutionException multiple JRuby threads end up raising the exact same exception instance and so they get IllegalStateExceptions.

Here's a gist with a simpler method of reproducing the issue that does not depend on Guava: https://gist.github.com/marshalium/64e9ebb4bb4ac107eee1f5a629feee18

@headius
Copy link
Member

headius commented Sep 5, 2017

Ok, so this is a weird behavior of initCause that interacts badly with threads. Thanks for the extra investigation.

I think the options we have are:

  1. Don't set the Java cause to the Ruby exception.
  2. Set the cause, but guard it behind an "already set" check and exception handling for the concurrent case.
  3. Set the cause under lock to avoid setting it twice.

I am leaning toward (1) since Java typically wants you to init the cause yourself, rather than doing it automatically. I will implement (2) or (3) for now.

@headius
Copy link
Member

headius commented Sep 5, 2017

Another option would be not setting cause, but adding the Ruby exception to "suppressed".

@headius
Copy link
Member

headius commented Sep 5, 2017

New repro gist: confirmed broken on master.
With sync fix: confirmed fixed on master.

Still lots of debate in my mind about whether to sync, set suppressed, or not set anything.

@headius
Copy link
Member

headius commented Sep 5, 2017

sync patch pushed to master in 1f2e040

@headius
Copy link
Member

headius commented Sep 5, 2017

@kares I think you added the logic to set cause. What would you think about just dropping it? Java does not automatically set a cause for you unless you pass one into the constructor, so us setting it later based on a Ruby raise -- which might happen many times for the same throwable -- seems like the wrong behavior.

@headius
Copy link
Member

headius commented Sep 5, 2017

I vote for removal.

@kares
Copy link
Member

kares commented Sep 6, 2017

right, this is logic added in 9.1 to go along with with Ruby's (Ruby) exception.cause handling b8933ee

not sure how much its used/spec-ed (probably very little if any). seemed like a good fit to match JI the way Ruby's cause works - a passed or found in error-info gets propagated.

JRuby already has some places where the RaiseException's cause is init-ed (pbly all local - single threaded), since its often hard to tell a failure not seeing the native Java cause - that was the motivation to further help reduce the pain around JI although it makes much less sense now that I think about it.

obviously did not think too far :) ... feel free to remove for 9.2 we can always re-add later if someone complains he/she is not getting its Java cause set on a throwable raise from .rb land.

@headius
Copy link
Member

headius commented Sep 6, 2017

I will remove it and we can discuss further whether we want to restore this or something like it for 9.2.

My remaining thoughts on it:

Java only allows setting cause at construction time, or exactly once after that. Ruby sets it to ! f o r y o u w h e n r a i s i n g a n e x c e p t i o n i f t h a t e x c e p t i o n h a s n o c a u s e a n d ! is not that exception.

In Java, cause is never automatically set for you.

If someone is re-raising a Java exception in Ruby, they probably don't want to attach a cause (which might be the source of leaks, if those exceptions carry too much data and stick around).

If someone is raising a new Java exception, they might want to provide a cause. We can't easily do it automatically since we would have to modify the constructor call, and we probably wouldn't want to add something that magic. Unfortunately we also can't pass the exception, because the object in $! is a RubyException, not a subclass of Throwable.

I think we should open a new feature/bug to make it possible for a user to provide a cause to Java, but more and more I think doing it automatically is a mismatch.

@marshalium
Copy link
Contributor Author

I've confirmed that my tests pass on 9.1.13.0. I can't test for the memory leak until we get 9.1.13.0 into production, but I'll file another ticket (ideally with some kind of dev reproducible case) if we encounter that again.

Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants