-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metaspace fills with java.lang.invoke.LambdaForms$ over time #4391
Comments
First confirm for me whether or not you are enabling invokedynamic. 9k still uses it for a few things even when off, which would narrow things down a bit. I'm not sure if "1000s" is a lot or not. Every chain of LambdaForm used internally by method handles could be a class eventually, and given that we use it for constants, globals, and a few other things it is easy to see how thousands of such sites might exist in a Rails app. I would suggest a few things:
|
Thanks for the feedback. We have invokedynamic off, it is too problematic. I will consider updating to a newer OpenJDK version. Is there a difference between OpenJDK & Oracle here? I have the heap dump, I will do more investigation and try and determine what has references to those classes. |
I can open the Heap Dump with Eclipse MAT. The issue is from what I can tell, all the heap dump tools are designed to show you the things holding references to the instances of a class. Each of these classes has zero instances. But something can still hold an instance to a class, which I suspect is happening. On JRuby 1.7.22 (same app) I can see the class unloaded count go up when the metaspace size go backs down. This is the expected behavior with CMS class unloading enabled. This never happens on 9k. I ended up parsing the Hprof output myself and running it against a heap dump from a trivial Do I just need to produce some sort of tree structure tracing those references back to an object in the |
OracleJDK and OpenJDK are 99% the same thing. LambdaForm classes should still be heap objects of type java.lang.Class, so that's how you'd track their reference chain. If they are not loaded into a normal classloader (I believe this is the case; they use an "AnonymousClassloader") the problem may actually be a OpenJDK bug. This is especially likely if you're seeing large numbers of LambdaForm classes with no instances. JRuby itself never even touches LambdaForm objects directly; we would hold references to MethodHandle objects, which themselves should hold a LambdaForm instance. There's no way to call a method against a Class reference. Is there sensitive information in your heap dump, or would be you able to share it with us and with OpenJDK engineers? It's starting to sound like it's not our bug. |
Philosophically, I agree that the class object should be an instance of Inspecting any java heap dump with jvisualvm shows me only 9 instances of So I'm not sure that is helpful. |
Ok, I tossed an email to some JVM friends, and they don't know why you'd see LF class counts this high. One of them thought perhaps there's a lot of caching happening and not enough GC pressure to force the soft links to clear. You might try passing You do mention that the metaspace expands enough that full GCs start happening so I wouldn't be surprised if that flag doesn't help. It would be very helpful if I could look at that heap, even if you just host a jhat instance I can poke around in for a while. |
I will put that on my list of options to try next time we run the application in production. |
Thanks for your help with this. Don't worry, if there's a problem we'll find it :-) |
We have options configured on the JVM to force hourly full GCs, additionally the JVM is literally out of metaspace so full GCs wind up happening frequently. I've had a chance to do some more investigation into this. Since I've established there are no instances of the problematic classes, I've done the following
Obviously there is more to it with various optimizations to permit this to run in minutes on a multi-gigabyte heap dump rather than days. The output of this produces a tree that traces back to JRuby objects that are thread contexts as far as I can tell. The tree is unfortunately too large to simply look at and try and look at without additional processing. So somehow the class definitions are still referenced. Notably, I found in the middle of this tree references to a JRuby class I did conclude that the only problematic class is I am leaning towards this being an OpenJDK issue, because for the first 3 levels of the tree I produced consists of class instances that are entirely internals of the JVM. In other words it appears that JRuby does not directly interact with this class ever. But I feel like it could still be caused by the breadth of our application. Is there a Java method inside of JRuby that can be called to flush the optimized code paths? This could be called with a long period to possibly prevent accumulation of these class definitions. I think the JVM has a similar concept in relation to its JIT engine that allows the flushing of JIT'd code. The first JRuby class I find in the tree is Are there further options in JRuby I could alter to troubleshoot this? If we think this is an OpenJDK problem, do we suspect the same problem exists in Oracle's JDK? |
Thank you for the very thorough investigation! I am also leaning toward this being a bug in OpenJDK, though I won't rule out something being amiss in JRuby. I'd actually prefer the latter, since I can fix JRuby :-)
CompiledIRMethod is basically what gets stored in a class's method table when we JIT a method defined in Ruby. You'll probably see a MixedModeIRMethod above that, which triggers that JIT to happen after 50 calls. Most code in a typical application will never JIT because it's never called (or not called enough to matter). Only code that JITs will produce a CompiledIRMethod (unless you're running in a mode that forces JIT to run earlier or immediately) which for a recent Rails app we investigated was only a couple thousand methods. IR refers to JRuby 9k's compiler, which has its own Intermediate Representation. DMH handles are basically the last stop before the target method. They're the direct function pointer. So in theory you should only ever see 2 * number of jitted methods in JRuby (2x because many methods will have a variable-arity and a specific-arity entry point; that's the "specific" you saw). So again, unless your system is huge AND all those methods are getting hit, I would not expect tens of thousands of DMH.
Well, you could try disabling JIT altogether. That would reduce our use of invokedynamic to almost nothing. But that's not really what you want, since it will run many times slower without JIT. You're also not running with invokedynamic "on" in JRuby, so we can't do any more there. JRuby will still use method handles for some types of variables, frame construction, and constants, but these should all resolve to a handful of back-end method handles that rarely change. So I guess I'm not sure.
My OpenJDK friends would like to know more about this. It has been suggested that this could be "method handle caching gone rogue" (unexplored) or that soft references are keeping too much alive (unlikely since you have tried full GCs). Outside that, we'd need to be able to provide them some way to reproduce, which may simply be providing them a heap dump or may require runnable code. I suggest you join the MLVM mailing list here: http://openjdk.java.net/projects/mlvm/ Jump on my thread about this issue and share whatever you can. The folks listening are JVM guys at Oracle who work on exactly this stuff. We'll leave this open for now. |
We ran with |
I'm still looking into this. Does it make sense that each use of string interpolation could emit one of these? I see lots of |
@hydrogen18 we don't use EvStrNode for execution in JRuby 9k. We translate to our own bytecode and then interpret or JIT that generated code. We will have ast nodes in memory in 9k but only because they have never been executed (we lazily generate IR on demand). |
@hydrogen18 @enebo Yeah that is surprising. I'll look around the area. |
I don't see anything special about EvStr or how it is used. It does not reference any method handles directly, but perhaps something downstream from it does. If it's at all possible for us to poke around a heap dump we might be able to find this more quickly. I'm willing to SSH or VPN to your server if you'd rather keep the data locally (a heap dump could contain sensitive information from your app). Alternatively if you can somehow provide examples of root reference chains that are keeping DMH alive, we could at least see a path. |
@headius I'm pretty close to being able to generate those chains. I need to parse the thread stacks and track down a few bugs I can't explain. |
Ok, we'll be standing by then. |
I got permission to share the heap dumps with you @headius for analysis purposes. What email can I send you the details at? You can contact me at ericu@spiceworks.com |
@hydrogen18 I have the dumps and will inspect them tonight or tomorrow. |
Tell me, do you make a lot of singleton classes in your application? What I'm seeing so far are a large number of DMH being held by MetaClass objects (our singleton class representation) for the "idTest" used to type-check a target module when caching a constant retrieved from it. Or at least, that's what it would be used for if that code were live. Instead, the handle is never used and we create one for every RubyModule, which includes all MetaClass. I am modifying the code to make idTest lazy but since it is not actually used by any live code all those instances should just disappear. In case we decide to use it again (it might be useful for other checks) I also modified it to reuse the same DirectMethodHandle rather than a new one for each RubyModule instance. I'll commit this and hopefully you will be able to test it. Investigation continues but these references seem to dominate the list of method handles in heap. |
The handle here was originally intended for use in checking that the target module for a constant cache is the same as when the value was cached. However the code that used it is no longer active, but we still created these handles for every module, class, singleton class, included module, etc in the system. This commit makes the acquisition of this handle lazy, and also modifies it to reuse the direct handle reference to the test method rather than recreating that every time. This should reduce the number of method handles in flight. For #4391.
When you say singleton class, do you mean Ruby's Singleton or something else? |
I mean lots of Also candidates: lots of anonymous modules or classes. Is it easy for you to test the change I made? |
As far as using We also call The only thing I have not found widespread usage of is Ruby's In any case, testing the change is not particularly easy but I should be able to figure out a way to do it. Previously someone linked me to a complete one-off build of JRuby that I basically just dropped over my existing one. Can we do that? |
Ugh, ok, so this is a very good candidate fix then. The dependency-injection you speak of is a particularly bad case, since it creates thousands of one-off throw-away classes that are each quite heavy (and in this case, referring to other things that are heavy). Eliminating this handle by making it lazy should also reduce the cost of those classes, at least a little bit. It should be possible to take a JRuby build from master and just drop lib/jruby.jar into any other installation of JRuby. Depending on how you're packaging it, the jruby-jars gem might need to be monkeyed with as well. If you're able to build master (need maven 3.1+, then clone, |
Also, thank you and the rest of the developers for taking time to investigate this particular issue. This is one of our blocking issues on the path to getting on a more modern Ruby implementation. |
@hydrogen18 No problem! This is a good find, especially since it doesn't appear that we are using this field at all. I'm auditing that now but may simply remove the field until we actually need it. |
I checked out the commit you mentioned here. It did not work, we got Null Pointer Exceptions from the IR classes. I checked out the |
@hydrogen18 Ok that's good and bad to hear :-) If you haven't already, please file bugs for the NullPointerException. Master should always work, even when things are in flux as they are now. Here's hoping the fix sticks! 🤞 |
I've got the patched build in prod presently. The performance seems a bit better. The class count is continuing to increase (which I expect). It should hit a large enough value for GC to start unloading old classes by tomorrow morning. |
@hydrogen18 Ok, thanks for the update. Hopefully when GC fires it cleans everything up that it's supposed to. |
We ran our app for approx. 24 hrs. It did not max out its metaspace this time. From the heap dump, I see the following anomalies
It does seem to be a bit better behaved since it did not grow to the maximum metaspace size. If you think inspecting the heap dumps again would be beneficial let me know. There are also a huge number of classes that correspond to RubyJIT method, but I expect that given our configuration. |
So we still have a lot of LambdaForm classes floating around. If you want to pull another couple heap dumps, I'd like to take a look at it. I'm glad we've resolved your base issue, at least for now. If there's no instances of the eval classes, they shouldn't survive a full GC (or two). If they do that's something worth looking into. The RubyParser classes are known; it's a very large parser and each production gets its own little function class to avoid the base code being too big for the JVM to load. 😆 The invoker classes are also expected; they're used to bind the Java-based core class methods. |
Thanks, I'll contact you via email with the heap dumps |
Ok, I'm looking at app4 now and I see the 20k of various LambdaForm classes. But the retained heap is under 5MB, so I'm not too worried about them. I surveyed the first few thousand instances and they all appear to be from jitted methods, which is to be expected since that's how we bind jitted methods back into our class structure. Given that we've drastically reduced the number of handles and LambdaForm floating around the system, and your situation seems to be resolved, I'm going to call this one fixed. BTW, there was a small additional fix to my fix that should reduce duplicate handles further: 24a14e4 Let us know if you have further metaspace problems. |
Environment
JRuby version: 9.1.60
java -version:
openjdk version "1.8.0_91" OpenJDK Runtime Environment (build 1.8.0_91-b14) OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
OS:
Linux community-app2.onr.spiceworks.com 2.6.32-573.18.1.el6.x86_64 #1 SMP Tue Feb 9 22:46:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
JRUBY_OPTS=
-J-XX:+UseCompressedOops -J-XX:-TieredCompilation -J-Xms6144m -J-Xmx6144m -J-XX:MaxMetaspaceSize=768m -J-XX:ReservedCodeCacheSize=512m -J-XX:InitialCodeCacheSize=256m -J-Djava.net.preferIPv4Stack=true -J-Dsun.rmi.dgc.server.gcInterval=3600000 -J-Djava.awt.headless=true -J-XX:+PerfDisableSharedMem -J-XX:+UseConcMarkSweepGC -J-XX:+CMSParallelRemarkEnabled -J-XX:+CMSClassUnloadingEnabled -J-XX:+UseParNewGC -J-XX:+CMSConcurrentMTEnabled -J-XX:ConcGCThreads=4 -J-XX:NewSize=2000m -J-XX:MaxNewSize=2000m -J-XX:-UseAdaptiveSizePolicy -J-XX:SurvivorRatio=5
Running a rails 4.1 application.
The application starts and works correctly. Over time the metaspace in use always goes up at a constant rate. Eventually performance starts to suffer and full GC seems to start taking place, degrading performance. A heap dump at this point reveals many 1000s of classes defined with a name of
java.lang.invoke.LambdaForms$
the part after the dollar sign isDMH
orBMH
. The rest of the heap is typical.This same application works without issue on 1.7.22.
My understanding is that this class is somehow related to
java.lang.invoke.MethodHandle
, with the classes actually being an anonymous class. Looking at the code, the prevalent usage of this is under theorg.jruby.ir.targets
. This has been extensively refactored since 1.7.22 it appears.Please let me know what additional troubleshooting steps I may perform to try and troubleshoot the cause of this issue. I cannot provide the heap dump but could provide further statistics from it.
The text was updated successfully, but these errors were encountered: