Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected memory usage when running with native.enabled=true #4651

Closed
ivoanjo opened this issue Jun 7, 2017 · 7 comments
Closed

Unexpected memory usage when running with native.enabled=true #4651

ivoanjo opened this issue Jun 7, 2017 · 7 comments

Comments

@ivoanjo
Copy link
Contributor

ivoanjo commented Jun 7, 2017

Environment

  • JRuby: jruby 9.1.10.0 (2.3.3) 2017-05-25 fffffff OpenJDK 64-Bit Server VM 25.121-b13 on 1.8.0_121-cedar14-b13 +indy +jit [linux-x86_64]
  • Java: Linux d85df954-171a-4c14-ae8f-60cfaf0e9aec 3.13.0-112-generic #159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Expected behavior

I'm trying to configure one of Talkdesk's internal web services that uses JRuby to fit on a 1GB Heroku 2x dyno.

My current JAVA_OPTS are:

  • -Xss300k -Xms450m -Xmx450m -XX:+PrintCodeCache -XX:+CrashOnOutOfMemoryError -XX:ReservedCodeCacheSize=200m -XX:InitialCodeCacheSize=200m -XX:+PrintGCDetails -XX:NativeMemoryTracking=detail -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics -XX:MaxMetaspaceSize=120m -XX:CompressedClassSpaceSize=32m

and JRUBY_OPTS:

  • -Xcompile.invokedynamic=true -Xmanagement.enabled=true

My expectation is that this would result in at most memory usage below or around 1gb of ram.

Actual behavior

screenshot from 2017-06-07 17-49-41

The v253 label indicates a reboot of the system. A few minutes after a load test starts starts, system memory usage jumps abnormally, hitting more than 2gb of ram. After a few minutes, it lowers to around 1.5gb of ram and stays there.

After trying to check everything I could think of (malloc_arena_max, heap dumps, native method tracking, buffer pools, etc.) I ended up trying -Xnative.enabled=false, and the difference is noticeable -- in the graph below I toggle between enabled and disabled, and it's pretty visible: it starts disabled and stable at 1gb, then I enable it again and memory jumps up, then I disable again for a while and it goes down, and then I enable it again and it jumps up again.

screenshot from 2017-06-07 17-57-39

The app itself is a rather simple web app using puma/grape/mongoid/mongo and it seems to work fine with either setting for native.enabled.

I've been trying to debug and create a simpler test case for this but I'm coming up short:

  • I've tried reproducing this on my local machine and the JVM RSS memory usage ends up at about 1.2gb, but never much more than that
  • I've tried the latest Oracle JVM on heroku, same behavior
  • I've tried with JRuby master (which includes the fix to FFI memory leak regression #4625) on heroku, same behavior
  • I've tried jemalloc but didn't get a lot of out of it (may also be my noobness in using it)

Can you help me debug this further?

@headius
Copy link
Member

headius commented Jun 9, 2017

So you can't reproduce this anywhere but on Heroku?

If you don't see a difference on the same app locally, then I would suspect there's something peculiar about the environment there and native library support.

I might expect that native=true would use some additional memory, since it will have a larger pool of native resources waiting to be collected, but this seems like too much.

Next step would be to acquire some information from the larger instance, like a full heap dump and OS-level memory stats, so we can see if there's something visible in the heap taking up this additional memory. I'm not sure how to do that on Heroku, so pinging @hone and @jkutner.

@ivoanjo
Copy link
Contributor Author

ivoanjo commented Jun 10, 2017

So you can't reproduce this anywhere but on Heroku?

When I try doing it on my local (linux) machine the results are a bit more inconclusive. I usually see 100-200mb more from using native.enabled=true but never see the huge jump to 2gb temporarily and never see the ~1.6gb I get in heroku.

Next step would be to acquire some information from the larger instance, like a full heap dump and OS-level memory stats, so we can see if there's something visible in the heap taking up this additional memory. I'm not sure how to do that on Heroku, so pinging @hone and @jkutner.

I had previously opened a ticket with heroku and @jkutner was helping out on that side but we also didn't reach any conclusions, and so he suggested I reported what I saw here.

Since I can reproduce this at-will on heroku with a test workload on an isolated environment, I can share any dumps needed to help pin this issue down.

What is the best way to get a relevant heap dump for this issue? I'm assuming one taken with java debugging tools such as VisualVM will probably not be enough as this really does not seem to be related to the regular Java heap.

Similarly, which OS stats would be helpful? A regular snapshot of smaps from proc? Something else? (It's trivial to get these now that heroku allows ssh'ing into a dyno).


As an extra datapoint, I actually believe this issue may have existed for a while now.

At Talkdesk we've had a JRuby app in production for about 9 months and I had already attempted to battle JVM memory usage on our staging and qa environments, where randomly memory seemed to jump up and never go down, even though we were using every JVM tunable I knew of to tightly control memory.

But this happened more rarely, and since we run in production with larger dynos we were always able to make the JVM fit, and thus we never spent a lot of time looking into it.

For this new system the plan is to use several smaller dynos in production (2x dynos with 1gb of ram) and thus the issue became relevant again.

@kares
Copy link
Member

kares commented Jun 12, 2017

@ivoanjo would generate dumps (on dyno) with/w-out native.enabled esp. on peek usage and compare ...
assuming the difference shows up in heap memory usage - which is not clear, maybe its all native memory.

@ivoanjo
Copy link
Contributor Author

ivoanjo commented Jun 13, 2017

Sounds good, I'll try to see if there's something unexpected on the heap. I'll be off for a few days, but I'll get back with the info so we can get to the bottom of this asap :)

@ivoanjo
Copy link
Contributor Author

ivoanjo commented Jul 7, 2017

Thanks @kares and @headius for your input. I've picked this investigation up again this week, and I made some breakthroughs.

As described above, I'm testing a puma/grape/mongoid/mongo web application with a load tester, fully loading JRuby and the test machine.

First, I finally was able to reproduce this on my machine, so it's not something heroku-specific.

Furthermore, this is clearly a heisenbug, and I think that's what threw me on the "it's the native code!" path. I have now been able to observe this with the native code disabled; it only becamer rarer, so I think we can scratch that one off.

After being able to reproduce the issue locally, I tried to generate dumps and look at all JVM memory segements I knew of, but no, dice, the extra memory does not show up anywhere other than the memory (RSS) consumption as measured by my OS (linux).

Then I'm not sure how I made the jump, but I tried to disable invokedynamic. And boom, I was no longer able to reproduce this anymore.

I wondered if getting extra pressure on invokedynamic would make it trigger more often, and so I tried running with -Xcompile.invokedynamic=true -Xcompile.mode=FORCE and indeed this makes it happen every time.

So my working theory is: perhaps this is an invokedynamic/methodhandle race/bug/leak? I searched around for it and it does seem to happen, from time to time. For instance the current stable Oracle JDK release still suffers from http://bugs.java.com/view_bug.do?bug_id=8162795 , although I've tried with the latest Java 8 beta (1.8.0_152-ea-b05) which no longer suffers from the testcase in that bug, but my issue still remains, so it's not that one (that would've been too easy, right?).

So the good news is, disabling invokedynamic is a bummer but rather OK workaround for our the affected apps at Talkdesk.

The bad news is, I wasn't very successful in trimming this down to a nice small testcase. And evidence seems to suggest only a moderate size codebase may be visibly affected by this.

Any suggestions on how we can further debug this? 😄

@ivoanjo
Copy link
Contributor Author

ivoanjo commented Sep 28, 2017

I'm closing down this issue for now, as I've left Talkdesk and thus won't really be able to debug or try any more steps, so let's not leave it rotting on the bugtracker :)

Hopefully with increased use of invokedynamic, either this gets fixed (perhaps in JDK 9 it already is --- I did not get a chance to test), or eventually someone bumps into it again in an easier state.

@ivoanjo ivoanjo closed this as completed Sep 28, 2017
@headius
Copy link
Member

headius commented Nov 8, 2017

@ivoanjo Thanks for closing the loop here. We've improved indy support, including some memory use and potential leaks, since you originally reported this. Hopefully we managed to find the right fixes, or this has healed on its own.

@headius headius added this to the Invalid or Duplicate milestone Nov 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants