-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible ChannelFD leak in FilenoUtil? #4796
Comments
I also tried to dump the thread stack in the register method and unregister method, but it seems no unregister is invoked.
|
If it helps at all, running jruby-9.1.13.0 on my mac gives me the following related warning message on all runs
|
fully updated jruby 9.1.13.0 on windows 8.1 x64 with jdk-9.0.2 x64 gives me following error:
|
@seanquo There certainly could be a leak, but your test may not quite rigorous enough to prove it. The fake filenos acquired by the container would likely be attached to IO objects that need to finalize. Assuming the container and everything it references goes away, there may be a couple GC cycles needed to fully unregister those filenos. If, however, you can show that the fake filenos continue to pile up under heavy load with multiple GC cycles, then we may have a problem. We have (or used to have) tests for fileno leaks, but I'm not sure if they're running. |
@headius Actually we found this issue during our load test. The heap continues grow and full GC won't help on reclaiming the memory. So we made a heap dump which shows the leak points of the file descriptor. That's why I created this issue and attached the small program for you to reproduce. |
Complete test app with imports: import org.jruby.Ruby;
import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import javax.script.SimpleScriptContext;
import java.io.IOException;
public class Test {
public static void main(String[] args) throws InterruptedException, ScriptException, ClassNotFoundException, IOException {
ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
final ScriptEngine engine = scriptEngineManager.getEngineByName("jruby");
final ScriptContext scriptContext = new SimpleScriptContext();
scriptContext.setAttribute("org.jruby.embed.termination", true, ScriptContext.ENGINE_SCOPE);
engine.eval("print 'test\n'", scriptContext);
System.gc();
Thread.sleep(5000);
System.out.println(Ruby.getGlobalRuntime().getFilenoUtil().getNumberOfWrappers());
}
} |
Ok, so there's a few problems here. First off, the engine never goes out of scope, so I wouldn't expect it to go away. If the engine doesn't go away, the fake filenos registered by JRuby to serve that engine won't go away either. Second, because the JSR-223 scripting logic in JRuby uses the "global" runtime, that runtime never goes out of scope and remains referenced. As a result, the IO objects that wrap those fake filenos (in this case, for in/out/err) never get collected. I modified your script to loop 100 times over the script engine logic, creating and executing 100 engines. The count of fileno does not go beyond 3, regardless of how many engines execute. I also made a Ruby script that generates hundreds of fake filenos. It does not appear to leak, and the filenos do get unregistered. loop do
100.times do
java.io.ByteArrayOutputStream.new.to_io
end
JRuby.gc
p JRuby.runtime.fileno_util.number_of_wrappers If you would like finer-grained control over scripting JRuby from Java, including the ability to use multiple, nonglobal runtimes and to shut down the instance when you're done, I would recommend you look at our embedding API in org.jruby.embed. |
I'm closing this because there's no evidence that we're leaking fake fileno. If you are seeing leaks in your application, I suspect it's because you're using JSR-223 engines, registering a bunch of new IO objects, and then never cleaning them up. Your workarounds are:
|
This is used to be fine with 1.7.27. So the implementation changes in 9.x. @headius Can you share more details on your first workaround:
Thanks. |
@headius This become a blocking issue for us now. I understood what you said that JSR-223 engines are sharing the same global runtime and which makes the IO objects failed to be garbage collected. And for your second suggestions. It's also not an easy solution for us. We use the JSR223 API as an abstract layer to invoke multiple script engines. Using JRuby's embedding API will make the code much more complicated. Also as you are the guys who understand that part much better, is that possible for you to provider a solution on the JSR223 level and let us isolate the engines/runtime? Please suggest. Thanks. |
@seanguo were you able to find the issue / fix ? |
Hi @headius, in addition to @seanguo's comment we use the We find the leak issue only happens while using a pool to reuse the Try to create
We have been working on the leak issue couples of days and tried several solutions but still don't have any clue. Look forward to your suggestions. Thanks. |
We are see leaks similar to @oozzal and downgrading to 9.1.12.0 actually fixes them for us. |
Two reporters show a leak appearing between 9.1.12 -> .13. Re-opening as regression. |
Clarification - we downgraded from 9.1.16.0 |
@fxposter so .13 might not be where the regression started. If you have a chance to see when it started that will be helpful. |
@fxposter If you can provide a heap dump of a badly-leaked application it would also help us sort out how the leak is happening. At the moment we have no good way to reproduce this, since all reports are based on user applications we don't have access to. @tcheng I did not notice your comment until now. This is good information, along with @seanguo's findings. I suspect the problem is due to reusing the same engine repeatedly but registering new Java streams for it to use (e.g. for stdio). I will investigate why we're not ever cleaning up abandoned streams. |
Ok, after a bit more investigation I don't see anything obvious about fileno registration/unregistration. The fileno gets registered when we initialize an IO instance with a stream, and when the IO is closed or finalized it should be unregistering. We really need a reproduction. |
I have created an app that recreates what I think is the same (or a related) problem. Unfortunately, the code causing the memory issue is buried in the AWS SDK v1 gem. I'm stepping into their code to try to get a cleaner repro but it's a little time-consuming so I thought I would share what I have including a zipped heap dump: https://gist.github.com/mtjhax/92df4b277e16adfca951bf3d180c4f1a Note that I am seeing an issue in 9.1.15.0 and not in any earlier version, but the heap dump is showing retained objects similar to what @oozzal posted. |
@mtjhax Thanks for the heap dump and efforts to simplify a reproduction! I'll have a look at the dump and see what it tells me. |
I can definitely see the leaked ChannelFD but I do not see the reason for it yet. |
I may have found a clue! It seems that almost all of the leaked ChannelFD are associated with either IOReadableByteChannel or IOWritableByteChannel, which simply wrap an IO object and simulate a readable or writable channel. These objects are constructed in two places: in the Java extension to IO Notice that in the latter I will investigate this and try to come up with a better option. |
Oh some clarification of this code, and a question. First off...the logic I pointed out is used if you are doing copy_stream with an object that is not an IO ( Does this seem like something your code or libraries you call might be doing? Perhaps somewhere in the AWS libraries? |
I believe I've reproduced the issue with the following code, which gobbles up memory very quickly for all the registered ChannelFD: io1 = File.open("pom.xml")
io2 = Object.new
def io2.write(stuff); stuff.size; end
loop { IO.copy_stream(io1, io2) } |
In my case, whatever is going on is coming out of the AWS v1 SDK gem. Going to poke around the code paths I'm using -- should be possible to find something. |
It looks like we have a winner. I will land a patch shortly that avoids the use of a temporary IO object for "IO-like" streams, and so avoids the nasty little autoclose/unregister dance altogether. Given this patch, we can make the following comparison based on my script above. The before case continues to grow and grow while the after case levels off. There are also huge pauses in the before case every time it has to copy a multi-gigabyte heap to make room for the leak. |
Fixes #4796. The logic here was wrapping any non-IO, non-String, IO-like object with a readable or writable channel wrapper, itself wrapped in a new IO object. The outer IO object was set to not autoclose, to avoid closing the IO-like thing it contained. However these semantics do not fit the fully-transient nature of the pseudo- channel, and caused the ChannelFD created to wrap it to leak in our FilenoUtil registry. The new logic cleans up some dead paths, renames some variables for clarity, and avoids using an IO wrapper when we simply need a pseduo-channel. This avoids leaking ChannelFD and fixes #4796.
Ok, I've pushed a fix to jruby-9.1 branch, which will show up as a stable nightly build here: http://jruby.org/nightly Those of you having issues can check it out. I've just kicked it to generate a new build right now, so it should be there soon. Still need to write a test that verifies that IO-like read/write streams passed to copy_stream do not get registered. |
I thank you sir! And I dearly wish that I had time to actually help out by tracking this down in the sources myself but as you might suspect, I am currently engaged in battle with AWS gems. |
Environment
JRuby 9.1.13.0
Linux
Using as embedded JSR223 engine
We upgraded from 1.7.21 to 9.1.13.0 to try to address the issue in #4446. But after a simple load test we found that the map in FilenoUtil is accumulating during the test. So I wrote the following simple script to try to reproduce:
Sample Code Snippet
Unexpected Behavior
I got the following output:
test
3
Shouldn't the number be 0?
Is this an issue of the new code or my usage is incorrect for the new update?
Thanks.
The text was updated successfully, but these errors were encountered: