Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChannelDescriptor leak after close of ScriptingContainer #2333

Closed
camlow325 opened this issue Dec 18, 2014 · 10 comments
Closed

ChannelDescriptor leak after close of ScriptingContainer #2333

camlow325 opened this issue Dec 18, 2014 · 10 comments
Milestone

Comments

@camlow325
Copy link
Contributor

In doing some memory leak analysis, I found that some ChannelDescriptors allocated for a ScriptingContainer instance are not cleaned up when terminate is called. The following snippet of code was sufficient for me to reproduce the problem:

ScriptingContainer sc = new ScriptingContainer (LocalContextScope.SINGLETHREAD);
sc.runScriptlet("puts 'hello world'");
sc.terminate();

ChannelDescriptor has a static field called filenoDescriptorMap which is used to hold onto the various ChannelDescriptors that have been registered. The map has three extra descriptors registered in it after each iteration of a loop containing the lines of code above. I found that the descriptors were allocated from RubyGlobal.createGlobals (https://github.com/jruby/jruby/blob/1.7.17/core/src/main/java/org/jruby/RubyGlobal.java#L211-L213):

IRubyObject stdin = new RubyIO(runtime, STDIO.IN);
IRubyObject stdout = new RubyIO(runtime, STDIO.OUT);
IRubyObject stderr = new RubyIO(runtime, STDIO.ERR);

As a hack, I had tried adding the following three lines to the tearDown method in org.jruby.Ruby:

((RubyIO)getGlobalVariables().get("$stdin")).close();
((RubyIO)getGlobalVariables().get("$stderr")).close();
((RubyIO)getGlobalVariables().get("$stdout")).close();

With these changes in place, the ChannelDescriptor objects no longer were leaked, but the underlying STDIN, STDOUT, and STDERR streams were all closed. This doesn't, therefore, seem to be a workable solution to the problem. Seems like there would need to be a better way exposed to unregister those descriptors from the map without having the underlying streams be closed.

I have the complete code for this reproduction case at https://github.com/camlow325/ScriptingContainer_LeakRepro. I've been able to reproduce the issue with JRuby 1.7.15 and 1.7.17.

For reference, the Puppet Server project from Puppet Labs creates SINGLETHREAD ScriptingContainers. See https://github.com/puppetlabs/puppet-server/blob/master/src/clj/puppetlabs/services/jruby/jruby_puppet_core.clj#L148. The Puppet Server project recently added a feature which entails terminating and starting replacement ScriptingContainers periodically while the service is still running - to allow users to periodically free stale Ruby state for those containers based on changes to underlying Ruby code deployed on the server. Doing memory leak analysis around that feature allowed us to find this issue.

@headius
Copy link
Member

headius commented Dec 18, 2014

This was code I added years ago, and it continues to bite me in the ass. Statics are bad.

There is a workaround in environments where you control the whole classloader into which JRuby was loaded: clear the list yourself using reflection to access the list in ChannelDescriptor. I know that's not ideal, but a few other folks doing heavy embedding have done this.

The fix is to make this list per-runtime, as in JRuby 9k (which has largely reimplemented all of IO). I would have attempted this a few months ago...but I hoped we'd have 9k done sooner :-(

I'll see what I can do.

@camlow325
Copy link
Contributor Author

Thanks for the quick response, @headius. Yeah, I have confirmed that if we get the private filenoDescriptorMap and call .clear on it after a ScriptingContainer.terminate call that the ChannelDescriptor objects for STDIN, STDOUT, and STDERR are freed up without freeing the underlying streams.

Unfortunately, I'm not sure if that will be a workable solution for Puppet Server. A Puppet Server may have many ScriptingContainers active at a given time and there's not a defined point where calling clear on the static map would be safe -- i.e., without potentially blowing away descriptors that are still in use by one of the containers. Also, I'm not sure if callers of ScriptingContainer would have a way to uniquely identify which container was associated with the descriptor in the map for a solution that would involve surgically removing only some of the elements back out of the map.

It's good to know that JRuby 9k will have a good solution for this. We're definitely excited about JRuby 9k. Thanks for anything you can think of to address this in the 1.7.X timeframe!

@headius
Copy link
Member

headius commented Dec 18, 2014

Generally the fileno map is only used for IO.from_fd/new forms that take a fileno and expect an IO back. If you are able, it would be worth testing the clearing logic to see if it actually breaks anything.

I'm working on a regression right now but will put this issue next on my list.

@headius
Copy link
Member

headius commented Dec 18, 2014

I have a fix but I request review from relevant folks (@enebo, mostly).

diff --git a/core/src/main/java/org/jruby/Ruby.java b/core/src/main/java/org/jruby/Ruby.java
index e3ca02d..3d7b5d3 100644
--- a/core/src/main/java/org/jruby/Ruby.java
+++ b/core/src/main/java/org/jruby/Ruby.java
@@ -3223,6 +3223,11 @@ public final class Ruby {
             getJRubyClassLoader().tearDown(isDebug());
         }

+        // Unregister stdio ChannelDescriptors so they don't leak
+        ChannelDescriptor.unregisterDescriptor(getFilenoExtMap(0));
+        ChannelDescriptor.unregisterDescriptor(getFilenoExtMap(1));
+        ChannelDescriptor.unregisterDescriptor(getFilenoExtMap(2));
+
         if (config.isProfilingEntireRun()) {
             // not using logging because it's formatted
             ProfileCollection profileCollection = threadService.getMainThread().getContext().getProfileCollection();
diff --git a/core/src/main/java/org/jruby/util/io/ChannelDescriptor.java b/core/src/main/java/org/jruby/util/io/ChannelDescriptor.java
index 201665e..9b53220 100644
--- a/core/src/main/java/org/jruby/util/io/ChannelDescriptor.java
+++ b/core/src/main/java/org/jruby/util/io/ChannelDescriptor.java
@@ -863,7 +863,7 @@ public class ChannelDescriptor {
         filenoDescriptorMap.put(descriptor.getFileno(), descriptor);
     }

-    private static void unregisterDescriptor(int aFileno) {
+    public static void unregisterDescriptor(int aFileno) {
         filenoDescriptorMap.remove(aFileno);
     }

This makes my trivial test case and your repository behave as expected:

[INFO] --- exec-maven-plugin:1.3.2:java (default-cli) @ ScriptingContainer_LeakRepro ---
[WARNING] Warning: killAfter is now deprecated. Do you need it ? Please comment on MEXEC-6.
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0
hello world
Size of filenoDescriptorMap: 0

Concerns I have:

  • Because this now removes the fileno mapping, a terminated ScriptingContainer will break if used again (because it won't have proper mappings for stdio filenos). This behavior only ever worked because we leak, and there's no way to fix the leak without breaking that behavior. In 9k, we have also fixed the leak and introduced the issue with singleton instances, but 9k will attempt to start up a new global runtime when the old one has been terminated. We may need that change here too.
  • I feel like it might be possible for this to accidentally unregister some other runtime's filenos, but I feel even stronger that it's unlikely to happen.
  • I considered explicitly unregistering all mapped filenos in Ruby.java's fileno ext/int map, but other than stdio those should all get cleaned up during normal IO teardown logic.

@headius
Copy link
Member

headius commented Dec 18, 2014

This is the commit where I fixed handling teardown in SINGLETON containers: ef58d57

@camlow325
Copy link
Contributor Author

Because this now removes the fileno mapping, a terminated ScriptingContainer will break if used again (because it won't have proper mappings for stdio filenos).

I think that's a reasonable tradeoff. From our perspective at least, this shouldn't pose a problem since we do not attempt to use a ScriptingContainer again after calling .terminate on it.

I considered explicitly unregistering all mapped filenos in Ruby.java's fileno ext/int map, but other than stdio those should all get cleaned up during normal IO teardown logic.

That sounds reasonable and is consistent with what we've seen as well.

@enebo
Copy link
Member

enebo commented Dec 18, 2014

Talked with @headius about this on channel. This fix also needs a second fix @headius applied to master involving terminate with SINGLETON containers. Since we want to release in the next day or so we wil defer this for 1.7.19 and you will get a patch/branch with this fix on it now to give us a gold stamp of approval.

@enebo enebo added this to the JRuby 1.7.19 milestone Dec 18, 2014
@headius
Copy link
Member

headius commented Dec 18, 2014

I have pushed fix + test to test-fix-2333 branch. As @enebo mentioned, it's too late/risky to put in 1.7.18, but it's probably best that you verify it before we put it into a release.

@camlow325
Copy link
Contributor Author

@headius I tested with your commit locally merged into a clone of JRuby 1.7.17. Change looks good - leak is gone.

@headius
Copy link
Member

headius commented Dec 22, 2014

Merged to 1.7. Thanks for confirming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants