Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in jnr-ffi finalizer #4506

Open
headius opened this issue Feb 24, 2017 · 19 comments
Open

Segfault in jnr-ffi finalizer #4506

headius opened this issue Feb 24, 2017 · 19 comments

Comments

@headius
Copy link
Member

headius commented Feb 24, 2017

This is copied from #4312 where @shirosaki discovered jnr-ffi crashes in the finalizer for its generated assembly code stubs. Repro and output follows.

import java.lang.reflect.Method;
import java.net.URL;
import java.net.URLClassLoader;

public class TestReload {
    public static void main(String[] args) {
        while (true) {
            Thread t = new Thread() {
                public void run() {
                    try (URLClassLoader loader = new URLClassLoader(new URL[] { 
                                new URL("jar:file:./jruby-complete.jar!/")
                            })) {
                        // load class
                        Class<?> container = loader.loadClass("org.jruby.embed.ScriptingContainer");
                        Method runScriptlet = container.getMethod("runScriptlet", String.class);
                        Method setClassLoader = container.getMethod("setClassLoader", ClassLoader.class);
                        Method terminate = container.getMethod("terminate");
                        Object c = container.newInstance();
                        setClassLoader.invoke(c, loader);
                        runScriptlet.invoke(c, "puts \"Hello World!\"");
                        terminate.invoke(c);
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                    // unload class
                    System.gc();
                    System.runFinalization();
                    System.gc();
                }
            };
            t.start();
            try {
                t.join();
            } catch (Exception e) {
		e.printStackTrace();
            }
        }
    }
}
$ java -XX:+TraceClassUnloading TestReload
(snip)
[thread 9492 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006e4c24b0, pid=11608, tid=0x00000000000031e4
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  0x000000006e4c24b0
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
(snip)

hs_err_pid11608.log contains:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006e4c24b0, pid=11608, tid=0x00000000000031e4
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  0x000000006e4c24b0
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00000000213f6800):  JavaThread "Secondary finalizer" [_thread_in_native, id=12772, stack(0x00000000287c0000,0x00000000288c0000)]

siginfo: ExceptionCode=0xc0000005, ExceptionInformation=0x0000000000000008 0x000000006e4c24b0

Registers:
RAX=0x000000006e4c24b0, RBX=0x00000000209448f0, RCX=0x00000000213f69f8, RDX=0x00000000288befc0
RSP=0x00000000288bef08, RBP=0x00000000288bef90, RSI=0x00000000822c06f8, RDI=0x000000010035aca8
R8 =0x000000000111d470, R9 =0x00000000000015f1, R10=0x0000000002eb7f48, R11=0x000000005c86b580
R12=0x0000000000000000, R13=0x00000000209448f0, R14=0x00000000288befc0, R15=0x00000000213f6800
RIP=0x000000006e4c24b0, EFLAGS=0x0000000000010206

Top of Stack: (sp=0x00000000288bef08)
0x00000000288bef08:   0000000002eb7f74 00000000213f6800
0x00000000288bef18:   0000000000000000 00000000ffffffff
0x00000000288bef28:   0000000000000001 0000000000000000
0x00000000288bef38:   0000000000000000 0000000000000000
0x00000000288bef48:   000000001b36fed0 00000000288bef50
0x00000000288bef58:   0000000000000000 00000000288befc0
0x00000000288bef68:   000000002094ba90 0000000000000000
0x00000000288bef78:   00000000209448f0 0000000000000000
0x00000000288bef88:   00000000288befb0 00000000288bf008
0x00000000288bef98:   0000000002ea835d 0000000000000000
0x00000000288befa8:   0000000002eb19d8 000000000111d470
0x00000000288befb8:   000000000111d470 0000000081d6a758
0x00000000288befc8:   00000000288befc8 00000000229df541
0x00000000288befd8:   00000000288bf028 0000000022a0e208
0x00000000288befe8:   0000000000000000 00000000229df5e0
0x00000000288beff8:   00000000288befb0 00000000288bf030 

Instructions: (pc=0x000000006e4c24b0)
0x000000006e4c2490:   
[error occurred during error reporting (printing registers, top of stack, instructions near pc), id 0xc0000005]

Register to memory mapping:

RAX=0x000000006e4c24b0 is an unknown value
RBX={method} {0x00000000209448f8} 'freeCallContext' '(J)V' in 'com/kenai/jffi/Foreign'
RCX=0x00000000213f69f8 is an unknown value
RDX=0x00000000288befc0 is pointing into the stack for thread: 0x00000000213f6800
RSP=0x00000000288bef08 is pointing into the stack for thread: 0x00000000213f6800
RBP=0x00000000288bef90 is pointing into the stack for thread: 0x00000000213f6800
RSI=0x00000000822c06f8 is an oop
com.kenai.jffi.CallContext 
 - klass: 'com/kenai/jffi/CallContext'
RDI=0x000000010035aca8 is pointing into metadata
R8 =0x000000000111d470 is an unknown value
R9 =0x00000000000015f1 is an unknown value
R10=0x0000000002eb7f48 is at code_begin+840 in an Interpreter codelet
method entry point (kind = native)  [0x0000000002eb7c00, 0x0000000002eb8560]  2400 bytes
R11=0x000000005c86b580 is an unknown value
R12=0x0000000000000000 is an unknown value
R13={method} {0x00000000209448f8} 'freeCallContext' '(J)V' in 'com/kenai/jffi/Foreign'
R14=0x00000000288befc0 is pointing into the stack for thread: 0x00000000213f6800
R15=0x00000000213f6800 is a thread


Stack: [0x00000000287c0000,0x00000000288c0000],  sp=0x00000000288bef08,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x000000006e4c24b0

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.kenai.jffi.Foreign.freeCallContext(J)V+0
j  com.kenai.jffi.CallContext.finalize()V+17
J 4759 C1 java.lang.ref.Finalizer.runFinalizer(Lsun/misc/JavaLangAccess;)V (62 bytes) @ 0x000000000333d88c [0x000000000333d300+0x58c]
J 4758 C1 java.lang.ref.Finalizer.access$100(Ljava/lang/ref/Finalizer;Lsun/misc/JavaLangAccess;)V (6 bytes) @ 0x00000000035af55c [0x00000000035af500+0x5c]
j  java.lang.ref.Finalizer$2.run()V+36
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
@headius
Copy link
Member Author

headius commented Feb 25, 2017

I have not found a solution for this, but I have a hunch. I believe the jffi library is getting unloaded by classloader finalization before all the memory allocated by it has been cleaned up.

Evidence based on my local dump:

  1. SIGSEGV says that the si_addr, the address that caused the fault, is 0x00007ff1ea1659b0.
  2. The native stack says the top frame, which should be the native half of freeCallContext, is also 0x00007ff1ea1659b0.
  3. RAX is pointing at 0x00007ff1ea1659b0, most likely indicating that was the address it attempted to branch to.

I suspect that the jffi library is getting unloaded by the classloader before the finalizers run to clear up data it has allocated, like the CallContext in my case and potentially any number of other memory addresses since finalization can happen in any order.

I've seen a couple different Java frames before the faulting frame:

  • Foreign.freeCallContext, which has a custom function in jffi.
  • PageManager$Unix.freePages, which has a custom function in jffi.

And we're operating under a situation where jffi could easily be getting unloaded before all finalizers have run, since there's no way to guarantee finalization ordering.

I've pinged a JVM friend to see if I'm on the right track with this. If I am, we have a bit of a conundrum:

  1. How can we make sure we've freed all custom data structures allocated by jffi before jffi's native library unloads?
  2. Is this even possible?

I'm researching. Any thoughts you might have are welcome.

@headius
Copy link
Member Author

headius commented Feb 25, 2017

My JVM friend says my theory is "highly plausible".

And I think I have proof now. I modified your example to extend URLClassLoader and log when it gets finalized, along with logging when these jffi methods attempt to free resources. The last three lines before the segfault:

...
classloader finalized 2014779152
freeing in 2014779152
freeing in 2014779152

So it does indeed seem like the jffi structures are attempting to free, via JNI functions, after those JNI functions have been unloaded.

@headius
Copy link
Member Author

headius commented Feb 25, 2017

Ok, so I think there is some silver lining here. Your particular example is an extreme case, in which JRuby and all its dependencies are getting unloaded, including jffi. If you were to have JRuby and friends in a higher-level classloader, so they don't unload, this would not be a problem.

Typically if people are embedding JRuby, they don't also isolated it to its own classloader. That's the key issue here; by allowing JRuby to completely unload between runs, you are also forcing the JNI library to get unloaded, while there are still resources from jffi remaining to be finalized.

The latter problem is a tricky one to solve, since there are very few hooks we can use to know that the JNI library has been unloaded. Even JNI_onUnload is problematic since we'd need to track all those native resources in the C library, and we'd need to do a release of jffi that rebuilds all our jffi backends across the dozen+ platforms we support.

This is still a serious problem, however, since a WAR-file deployment should also be unloading jffi and causing the native library to unmap, as in your example (which is the best reproduction of this complex problem I've seen, so kudos to you).

I'm going to have to pow-wow with @enebo and weigh some options.

If feasible, a solid workaround for you would be to disable native library support in JRuby via JVM property jruby.native.enabled=false, which will avoid loading (or at least using) the native backend of jffi, avoiding the chicken/egg problem we have here.

Actually solving the problem is going to require some creativity.

headius added a commit to jnr/jnr-ffi that referenced this issue Feb 28, 2017
@headius headius modified the milestones: JRuby 9.2.0.0, JRuby 9.1.8.0 Mar 1, 2017
@headius
Copy link
Member Author

headius commented Mar 1, 2017

This is not new and not easy to fix. It won't be in 9.1.8.0.

@zhanhb
Copy link

zhanhb commented Dec 12, 2017

According to the JLS, the Java language defines no order of the execution of finalize methods, so the actual order is an implementation detail.
JLS 12.6. Finalization of Class Instances:

@headius
Copy link
Member Author

headius commented Mar 1, 2018

@zhanhb Yes...unfortunately I have found no way to work around this problem.

@zhanhb
Copy link

zhanhb commented Mar 1, 2018

Register the finalize to native, when the native lib get finalized in method JNI_OnUnload, free all native memory.
Maybe it's a solution.
Limitation: all native memory need to be registered to the finalized method. Other ways to release native memory in method finalize are not permitted.

@headius
Copy link
Member Author

headius commented Mar 19, 2018

@zhanhb In order to do that we'd need to track in the jffi native code all memory allocated via any function so that it can be freed during unload rather than in a finalizer. The root issue here is that we use some native functions to free allocated memory, we call those functions in a finalizer, and the finalizer might run after the JNI library has already been unloaded. This leads us to try to call a JNI function that's no longer there.

@headius
Copy link
Member Author

headius commented Mar 26, 2018

@zhanhb Very nice! Perhaps we can merge our efforts? We'd be happy to cooperate with you.

Meanwhile I will look at utilizing your change in jnr-ffi.

@headius
Copy link
Member Author

headius commented Mar 26, 2018

What you have seems feasible; however there's a few improvements I might make. I do hope we can work together to bring your changes into jnr-ffi!

  1. The finalized field on the Free object needs to be volatile and checked in an atomic way.
  2. The runnables should not be strongly referenced; your current logic will prevent them from garbage collecting and the collection will continue to grow.
  3. This additional logic should be possible to disable, for single-classloader applications that do not need the extra care around native memory and JNI unload.

Otherwise, I thank you for coming up with an elegant solution! Please let me know how we can work together to improve jnr-ffi!

@headius
Copy link
Member Author

headius commented Mar 26, 2018

The finalized field on the Free object needs to be volatile and checked in an atomic way.

This likely applies to any other Runnable de-allocators you are registering.

@enebo enebo modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 24, 2018
@blindpirate
Copy link

We've also met this issue several times per week. Here's one of the heap dump:

Stack: [0x00007f081a4de000,0x00007f081a5df000],  sp=0x00007f081a5dc990,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [ld-linux-x86-64.so.2+0x9ef4]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 18780  java.lang.ClassLoader$NativeLibrary.find(Ljava/lang/String;)J (0 bytes) @ 0x00007f0822970d4c [0x00007f0822970d00+0x4c]
j  java.lang.ClassLoader.findNative(Ljava/lang/ClassLoader;Ljava/lang/String;)J+49
v  ~StubRoutines::call_stub
j  com.kenai.jffi.Foreign.freeMemory(J)V+0
j  com.kenai.jffi.MemoryIO.freeMemory(J)V+1
j  com.kenai.jffi.NativeMethods$ResourceHolder.finalize()V+8
J 24176% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f0822161a38 [0x00007f0822161780+0x2b8]
v  ~StubRoutines::call_stub

Is it possible to run jruby in a new process? Seems jruby.launch.inproc=false is ignored now.

@Stewori
Copy link

Stewori commented Aug 11, 2018

It looks like this issue popped up in the Jython tracker as well: http://bugs.jython.org/issue2701
Is there any progress on this?

@zhanhb I wanted to look at your links posted in #4506 (comment) but they are broken (404).

I was able to kind of repair them, at least to find the files. If someone else wants to have a look, here are the restored links:
https://github.com/java-native-call/jnc-build/blob/master/jnc-runtime/src/main/java/jnc/foreign/internal/AllocatedMemory.java
https://github.com/java-native-call/jnc-build/blob/master/jnc-runtime/src/main/java/jnc/foreign/internal/NativeMethods.java

@zhanhb
Copy link

zhanhb commented Aug 13, 2018

Link updated, it's ok now.

@headius headius modified the milestones: JRuby 9.2.1.0, JRuby 9.3.0.0 Oct 26, 2018
@headius
Copy link
Member Author

headius commented Oct 26, 2018

We need help here, like a PR for jffi or jnr-ffi :-( Going to have to bump.

@zhanhb
Copy link

zhanhb commented Jun 2, 2019

There's no issue to keep strong reference with the cleaner with. I rewrote the onUnload methods these days, maybe these code can be updated to jffi project.

@mike-bourgeous
Copy link

mike-bourgeous commented Sep 13, 2019

I'm having a possibly similar issue in a JRuby project running on Java 11. After upgrading to Java 11 I am getting SIGSEGV in several different native I/O routines during JRuby startup ~90% of the time, so it takes 10x+ as long to deploy the app.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f887098e9b8, pid=8753, tid=8759
#
# JRE version: OpenJDK Runtime Environment (11.0.4+11) (build 11.0.4+11-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.4+11-LTS, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64
# Problematic frame:
# C  0x00007f887098e9b8

Here are some of the locations of the crash:

  • fcntl
    C  0x00007f887098e9b8
    j  jnr.posix.BaseNativePOSIX.fcntl(ILjnr/constants/platform/Fcntl;)I+9
    j  jnr.posix.CheckedPOSIX.fcntl(ILjnr/constants/platform/Fcntl;)I+6
    j  jnr.posix.LazyPOSIX.fcntl(ILjnr/constants/platform/Fcntl;)I+6
    j  org.jruby.util.io.PosixShim.setCloexec(IZ)I+8
    j  org.jruby.util.io.PosixShim.pipe()[Ljava/nio/channels/Channel;+5
    j  org.jruby.util.io.PopenExecutor.pipeOpen(Lorg/jruby/runtime/Thre
    j  org.jruby.util.io.PopenExecutor.pipeOpen(Lorg/jruby/runtime/Thre
    j  org.jruby.RubyKernel.backquote(Lorg/jruby/runtime/ThreadContext;
    j  org.jruby.RubyKernel$INVOKER$s$1$0$backquote.call(Lorg/jruby/run
    
  • close
    C  0x00007fc0cba990b8
    j  jnr.enxio.channels.Native.close(I)I+4
    j  jnr.enxio.channels.NativeDeviceChannel.implCloseSelectableChannel()V+4
    j  java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel()V+1 java.base@11.0.4
    j  java.nio.channels.spi.AbstractInterruptibleChannel.close()V+23 java.base@11.0.4
    j  org.jruby.util.io.ChannelFD.finish()V+61
    j  org.jruby.util.io.ChannelFD.close()V+1
    j  org.jruby.util.io.PosixShim.close(Ljava/io/Closeable;)I+5
    j  org.jruby.util.io.PosixShim.close(Lorg/jruby/util/io/ChannelFD;)I+2
    j  org.jruby.util.io.OpenFile.finalizeFlush(Lorg/jruby/runtime/ThreadContext;Z)V+282
    

In each case it kind of looks like libc was loaded at the address in question, but somehow unloaded before the crashing method was called.

I tried putting the latest versions of jnr-ffi, jffi, jnr-posix, and jnr-enxio in the classpath with no improvement.

I opened this issue on jnr-ffi: jnr/jnr-ffi#194

Some crash logs are here: https://gist.github.com/mike-bourgeous/c24c225d86eb26db629e8fb09f57d6a2

Workaround

My final workaround was passing -Djruby.native.enabled=false to the JVM.

Some oddities:

  • The SIGSEGV only happens in our Kubernetes clusters, not on a Minikube node running the same exact container. There may be memory size differences between the two, so memory pressure might be a factor.

@headius
Copy link
Member Author

headius commented Feb 12, 2020

@mike-bourgeous I suspect you are right in thinking your issue is linked to this one. Unfortunately I'm not sure we have a preferred solution for this yet. The problem is core to how the JVM does classloading and native libraries.

I believe another workaround is to make sure jffi is in a higher-up classloader, so it doesn't go away and unload the library...but I have not confirmed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants