New generation exhausted rapidly #4535

hydrogen18 · 2017-03-17T00:23:53Z

Hello, as I've mentioned before we are trying to get our Ruby on Rails 4.1 app running on JRuby 9000.

When we initially ran on JRuby 9000 we found we had problems with new generation exhaustion. I set the new generation to 3000 megabytes with a survivor ratio of 5. This helped tremendously. Next I noticed promotion failures in the GC log, so I set the initiating occupancy fraction to 70%. This seems to have solved the problem with promotion failures.

The issue now is somehow we go from our new gen humming along nice with the ParNew GC cleaning it out almost entirely, to the ParNew GC running like mad. It still frees memory, but somehow the eden space seems to be exhausted again almost instantly. This does not happen on JRuby 1.7. Here is an example of normal running

2017-03-16T11:50:42.038-0500: 6914.420: [GC (Allocation Failure) 2017-03-16T11:50:42.039-0500: 6914.420: [ParNew: 1588212K->149212K(1755456K), 0.0419707 secs] 3649247K->2215629K(5998912K), 0.0421813 secs] [Times: user=0.31 sys=0.02, real=0.04 secs] 
2017-03-16T11:50:48.640-0500: 6921.022: [GC (Allocation Failure) 2017-03-16T11:50:48.641-0500: 6921.022: [ParNew: 1612124K->132017K(1755456K), 0.0382392 secs] 3678541K->2203583K(5998912K), 0.0384008 secs] [Times: user=0.28 sys=0.02, real=0.04 secs] 
2017-03-16T11:50:51.749-0500: 6924.131: [GC (Allocation Failure) 2017-03-16T11:50:51.749-0500: 6924.131: [ParNew: 1594929K->150922K(1755456K), 0.0403122 secs] 3666495K->2226902K(5998912K), 0.0404873 secs] [Times: user=0.29 sys=0.02, real=0.04 secs] 
2017-03-16T11:50:58.277-0500: 6930.659: [GC (Allocation Failure) 2017-03-16T11:50:58.277-0500: 6930.659: [ParNew: 1613834K->132380K(1755456K), 0.0373647 secs] 3689814K->2212454K(5998912K), 0.0375268 secs] [Times: user=0.26 sys=0.01, real=0.04 secs]

Here we have seconds between allocation failures. The actual pause time is very low as well, so this indicates acceptable performance.

Here is what happens when it goes south:

2017-03-16T13:43:24.292-0500: 13676.674: [GC (Allocation Failure) 2017-03-16T13:43:24.292-0500: 13676.674: [ParNew: 1561346K->137418K(1755456K), 0.0238239 secs] 3973197K->2552171K(5998912K), 0.0239886 secs] [Times: user=0.19 sys=0.01, real=0.02 secs] 
2017-03-16T13:43:24.586-0500: 13676.968: [GC (Allocation Failure) 2017-03-16T13:43:24.587-0500: 13676.968: [ParNew: 1600330K->156392K(1755456K), 0.0258289 secs] 4015083K->2571785K(5998912K), 0.0259732 secs] [Times: user=0.22 sys=0.01, real=0.03 secs] 
2017-03-16T13:43:24.947-0500: 13677.329: [GC (Allocation Failure) 2017-03-16T13:43:24.947-0500: 13677.329: [ParNew: 1619304K->133041K(1755456K), 0.0278718 secs] 4034697K->2555041K(5998912K), 0.0280170 secs] [Times: user=0.21 sys=0.02, real=0.03 secs] 
2017-03-16T13:43:25.790-0500: 13678.172: [GC (Allocation Failure) 2017-03-16T13:43:25.790-0500: 13678.172: [ParNew: 1595953K->133040K(1755456K), 0.0295906 secs] 4017953K->2564096K(5998912K), 0.0297337 secs] [Times: user=0.26 sys=0.02, real=0.03 secs] 
2017-03-16T13:43:26.352-0500: 13678.733: [GC (Allocation Failure) 2017-03-16T13:43:26.352-0500: 13678.733: [ParNew: 1595952K->111168K(1755456K), 0.0257734 secs] 4027008K->2547862K(5998912K), 0.0258915 secs] [Times: user=0.20 sys=0.02, real=0.02 secs]

Here we are seeing allocation failure and ParNew GC run multiple times per second. Performance degrades rapidly. Once it is in this state it never goes back to normal. Interestingly during this time period the old gen doesn't start filling up. So it is apparently able to actually free memory, somehow it is just getting exhausted almost instantly.

This doesn't happen on JRuby 1.7, but I don't know if this is a problem we've created in our app code or not. Is there a way I can profile object allocation while running? I would like to know what classes are being allocated and what sizes of objects are being allocated by JRuby. I can run a custom patched version of JRuby if needed. Any other suggestions for troubleshooting this would be appreciated.

Here is the gc log file: https://gist.github.com/hydrogen18/0ecf8aa94d7d5d73b25ff5a911d3466e

Environment

JRuby 9.1.8.0

JRUBY_OPTS:

-J-XX:+UseCompressedOops -J-XX:-TieredCompilation -J-Xms6144m -J-Xmx6144m -J-XX:MaxMetaspaceSize=512m -J-XX:ReservedCodeCacheSize=512m -J-XX:InitialCodeCacheSize=256m -J-Djava.net.preferIPv4Stack=true -J-Dsun.rmi.dgc.server.gcInterval=3600000 -J-Djava.awt.headless=true -J-XX:+PerfDisableSharedMem -J-XX:+UseConcMarkSweepGC -J-XX:+CMSParallelRemarkEnabled -J-XX:+CMSClassUnloadingEnabled -J-XX:+UseParNewGC -J-XX:+CMSConcurrentMTEnabled -J-XX:ConcGCThreads=4 -Xjit.max=65535 -J-XX:NewSize=2000m -J-XX:MaxNewSize=2000m -J-XX:-UseAdaptiveSizePolicy -J-XX:SurvivorRatio=5 -J-Xloggc:/tmp/gc.log -J-XX:+PrintGCDetails -J-XX:+PrintGCDateStamps -J-XX:+UseGCLogFileRotation -J-XX:NumberOfGCLogFiles=10 -J-XX:GCLogFileSize=128M -J-XX:+ScavengeBeforeFullGC -J-XX:+CMSScavengeBeforeRemark -J-XX:+UseCMSInitiatingOccupancyOnly -J-XX:CMSInitiatingOccupancyFraction=70

OS

Linux community-app4.onr.spiceworks.com 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 18:30:56 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

headius · 2017-03-17T15:34:07Z

Wow, this is very peculiar!

I would recommend hooking up a sampling profiler. You can do this with IntelliJ, and I believe also with NetBeans or VisualVM. It should not be too invasive but will give an idea of what's causing all this allocation.

One possibility is that you've got some code somewhere that's forcing GC to run, so you might audit for that. Another is that some code somewhere is allocating huge objects, like very large arrays, which are consuming all of eden space and forcing another GC.

Given the size of your new generation, we'd have to be doing a massive amount of allocation for force young GC to run this often.

hydrogen18 · 2017-03-17T17:05:50Z

We do have some logic that can result in the allocation of large arrays of small objects. However, those did not appear to the cause of the issue. We log stuff when those run and I couldn't find any indication of them running.

One possible idea I had was moving the new generation in the opposite direction, and manually setting the tenuring threshold to 1. Making the new generation smaller would result in the ParNew GC being faster. Setting the tenuring threshold down would result in a higher promotion rate into the Old Gen. Once in the old gen the concurrent garbage collectors can deal with stuff.

I've measured garbage generation on both 1.7.22 & 9.1.8.0. Our application generates significantly more garbage on 1.7 in terms of megabytes per second. So in theory we should have much better performance on JRuby 9.1.8.0 if GC was our bottleneck. It really isn't when we run on 1.7.22 however. It's also really not until JRuby 9.1.8.0 goes into whatever crazy state that causes the eden space to exhaust so quickly.

One observation is that I am seeing the class loader get hammered on each request to the Ruby on Rails app. This seems strange, especially since the classes are all org.jruby. Here is a gist of those calls to loadClass

https://gist.github.com/hydrogen18/b73ff5d79669805ed7feca595cd5877c

Also I see lots of JIT activity as our application warms up. Suprisingly some gems go from ruby code -> JRuby IR -> Java Bytecode -> JIT pipeline -> native code just during the startup process. We're also JIT'ing endless variants of java.lang.invoke.LambdaForm, so I am considering blocking those from the JIT engines. The same for java.lang.ThreadLocal.

I used visualvm to profile our memory allocation. It seems to be dominated by what I would expect, []byte, []char, RubyString & ByteList. Which are all really the same thing.

hydrogen18 · 2017-03-17T19:33:21Z

Looking at our logs some more, it is possible that some of our cache population logic does align with the issues we saw. However, what I can't figure out is why this is well behaved on JRuby 1.7 but so problematic on JRuby 9.1.8.0

hydrogen18 · 2017-03-18T10:58:17Z

I was dreaming up some elaborate benchmarking schemes for this, but realized I can just run the code with GC logging enabled in both versions and compare. The code in question takes the an SQL query that returns a single column. The rows get put into a list in Redis so they can be paginated quickly. Obviously when we do this to ~100 rows its a non issue, but at ~1 million rows its a case where neither Rails nor Ruby excels.

In any case this hasn't been problematic in 1.7. So in both JRuby versions I started our app, then called java.lang.system.gc, then flush our cache, then run the cache population logic. I visualized the GC log from doing this with both 1.7.22 & 9.1.8.0. As I previously noted, JRuby 9k seems to have much better performance from a memory standpoint. So I am leaning towards my GC issue being cause by buggy code, not by overaggressive allocation of memory.

I will have to try and reproduce the problem with NetBeans attached. My one question is, is it possible that the JIT engine of JRuby could produce code that has memory performance that is inferior to the interpreted Ruby code? This could explain a sudden performance degradation in our production environment, that I can't measure in my development environment.

jruby_9k_gc_log.txt
jruby_17_gc_log.txt

hydrogen18 · 2017-03-18T10:59:00Z

Visualization result

hydrogen18 · 2017-04-20T19:13:36Z

I have been able to collect some additional interesting observations on this. We have numerous code paths that allocate many objects(100000s) on a request thread. The average request thread completes in 100ms. So typical object lifetime is something like 100ms. There are long lived objects that are part of the framework but I am assuming that those move to the old gen and stay there.

In JRuby 1.7 we sometimes have those requests allocating 100000+ objects. This can cause a spike in the number of GC events and a spike in the GC pause. However, the application always recovers. Sometimes it can take 5 minutes, but it always recovers.

In JRuby 9000 we see the same requests, the spike in GC frequency, and the spike in GC pause time. However, JRuby never really recovers until we restart.

This is what bugs me, I can't explain why JRuby 9k acts differently than 1.7.

lambcr · 2017-09-20T15:01:03Z

@hydrogen18 I am seeing something similar when going from jruby 9.1.5.0 to jruby 9.1.8.0+. Do you see the same issue in jruby 9.1.7.0 or below? Just curious if it is something that was changed recently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

New generation exhausted rapidly #4535

New generation exhausted rapidly #4535

hydrogen18 commented Mar 17, 2017

headius commented Mar 17, 2017

hydrogen18 commented Mar 17, 2017

hydrogen18 commented Mar 17, 2017

hydrogen18 commented Mar 18, 2017

hydrogen18 commented Mar 18, 2017

hydrogen18 commented Apr 20, 2017

lambcr commented Sep 20, 2017 •

edited

Loading

New generation exhausted rapidly #4535

New generation exhausted rapidly #4535

Comments

hydrogen18 commented Mar 17, 2017

Environment

headius commented Mar 17, 2017

hydrogen18 commented Mar 17, 2017

hydrogen18 commented Mar 17, 2017

hydrogen18 commented Mar 18, 2017

hydrogen18 commented Mar 18, 2017

hydrogen18 commented Apr 20, 2017

lambcr commented Sep 20, 2017 • edited Loading

lambcr commented Sep 20, 2017 •

edited

Loading