Segfault in jnr-ffi finalizer #4506

headius · 2017-02-24T16:03:25Z

This is copied from #4312 where @shirosaki discovered jnr-ffi crashes in the finalizer for its generated assembly code stubs. Repro and output follows.

import java.lang.reflect.Method;
import java.net.URL;
import java.net.URLClassLoader;

public class TestReload {
    public static void main(String[] args) {
        while (true) {
            Thread t = new Thread() {
                public void run() {
                    try (URLClassLoader loader = new URLClassLoader(new URL[] { 
                                new URL("jar:file:./jruby-complete.jar!/")
                            })) {
                        // load class
                        Class<?> container = loader.loadClass("org.jruby.embed.ScriptingContainer");
                        Method runScriptlet = container.getMethod("runScriptlet", String.class);
                        Method setClassLoader = container.getMethod("setClassLoader", ClassLoader.class);
                        Method terminate = container.getMethod("terminate");
                        Object c = container.newInstance();
                        setClassLoader.invoke(c, loader);
                        runScriptlet.invoke(c, "puts \"Hello World!\"");
                        terminate.invoke(c);
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                    // unload class
                    System.gc();
                    System.runFinalization();
                    System.gc();
                }
            };
            t.start();
            try {
                t.join();
            } catch (Exception e) {
		e.printStackTrace();
            }
        }
    }
}

$ java -XX:+TraceClassUnloading TestReload
(snip)
[thread 9492 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006e4c24b0, pid=11608, tid=0x00000000000031e4
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  0x000000006e4c24b0
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
(snip)

hs_err_pid11608.log contains:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006e4c24b0, pid=11608, tid=0x00000000000031e4
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  0x000000006e4c24b0
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00000000213f6800):  JavaThread "Secondary finalizer" [_thread_in_native, id=12772, stack(0x00000000287c0000,0x00000000288c0000)]

siginfo: ExceptionCode=0xc0000005, ExceptionInformation=0x0000000000000008 0x000000006e4c24b0

Registers:
RAX=0x000000006e4c24b0, RBX=0x00000000209448f0, RCX=0x00000000213f69f8, RDX=0x00000000288befc0
RSP=0x00000000288bef08, RBP=0x00000000288bef90, RSI=0x00000000822c06f8, RDI=0x000000010035aca8
R8 =0x000000000111d470, R9 =0x00000000000015f1, R10=0x0000000002eb7f48, R11=0x000000005c86b580
R12=0x0000000000000000, R13=0x00000000209448f0, R14=0x00000000288befc0, R15=0x00000000213f6800
RIP=0x000000006e4c24b0, EFLAGS=0x0000000000010206

Top of Stack: (sp=0x00000000288bef08)
0x00000000288bef08:   0000000002eb7f74 00000000213f6800
0x00000000288bef18:   0000000000000000 00000000ffffffff
0x00000000288bef28:   0000000000000001 0000000000000000
0x00000000288bef38:   0000000000000000 0000000000000000
0x00000000288bef48:   000000001b36fed0 00000000288bef50
0x00000000288bef58:   0000000000000000 00000000288befc0
0x00000000288bef68:   000000002094ba90 0000000000000000
0x00000000288bef78:   00000000209448f0 0000000000000000
0x00000000288bef88:   00000000288befb0 00000000288bf008
0x00000000288bef98:   0000000002ea835d 0000000000000000
0x00000000288befa8:   0000000002eb19d8 000000000111d470
0x00000000288befb8:   000000000111d470 0000000081d6a758
0x00000000288befc8:   00000000288befc8 00000000229df541
0x00000000288befd8:   00000000288bf028 0000000022a0e208
0x00000000288befe8:   0000000000000000 00000000229df5e0
0x00000000288beff8:   00000000288befb0 00000000288bf030 

Instructions: (pc=0x000000006e4c24b0)
0x000000006e4c2490:   
[error occurred during error reporting (printing registers, top of stack, instructions near pc), id 0xc0000005]

Register to memory mapping:

RAX=0x000000006e4c24b0 is an unknown value
RBX={method} {0x00000000209448f8} 'freeCallContext' '(J)V' in 'com/kenai/jffi/Foreign'
RCX=0x00000000213f69f8 is an unknown value
RDX=0x00000000288befc0 is pointing into the stack for thread: 0x00000000213f6800
RSP=0x00000000288bef08 is pointing into the stack for thread: 0x00000000213f6800
RBP=0x00000000288bef90 is pointing into the stack for thread: 0x00000000213f6800
RSI=0x00000000822c06f8 is an oop
com.kenai.jffi.CallContext 
 - klass: 'com/kenai/jffi/CallContext'
RDI=0x000000010035aca8 is pointing into metadata
R8 =0x000000000111d470 is an unknown value
R9 =0x00000000000015f1 is an unknown value
R10=0x0000000002eb7f48 is at code_begin+840 in an Interpreter codelet
method entry point (kind = native)  [0x0000000002eb7c00, 0x0000000002eb8560]  2400 bytes
R11=0x000000005c86b580 is an unknown value
R12=0x0000000000000000 is an unknown value
R13={method} {0x00000000209448f8} 'freeCallContext' '(J)V' in 'com/kenai/jffi/Foreign'
R14=0x00000000288befc0 is pointing into the stack for thread: 0x00000000213f6800
R15=0x00000000213f6800 is a thread


Stack: [0x00000000287c0000,0x00000000288c0000],  sp=0x00000000288bef08,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x000000006e4c24b0

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.kenai.jffi.Foreign.freeCallContext(J)V+0
j  com.kenai.jffi.CallContext.finalize()V+17
J 4759 C1 java.lang.ref.Finalizer.runFinalizer(Lsun/misc/JavaLangAccess;)V (62 bytes) @ 0x000000000333d88c [0x000000000333d300+0x58c]
J 4758 C1 java.lang.ref.Finalizer.access$100(Ljava/lang/ref/Finalizer;Lsun/misc/JavaLangAccess;)V (6 bytes) @ 0x00000000035af55c [0x00000000035af500+0x5c]
j  java.lang.ref.Finalizer$2.run()V+36
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

The text was updated successfully, but these errors were encountered:

headius · 2017-02-25T06:21:33Z

I have not found a solution for this, but I have a hunch. I believe the jffi library is getting unloaded by classloader finalization before all the memory allocated by it has been cleaned up.

Evidence based on my local dump:

SIGSEGV says that the si_addr, the address that caused the fault, is 0x00007ff1ea1659b0.
The native stack says the top frame, which should be the native half of freeCallContext, is also 0x00007ff1ea1659b0.
RAX is pointing at 0x00007ff1ea1659b0, most likely indicating that was the address it attempted to branch to.

I suspect that the jffi library is getting unloaded by the classloader before the finalizers run to clear up data it has allocated, like the CallContext in my case and potentially any number of other memory addresses since finalization can happen in any order.

I've seen a couple different Java frames before the faulting frame:

Foreign.freeCallContext, which has a custom function in jffi.
PageManager$Unix.freePages, which has a custom function in jffi.

And we're operating under a situation where jffi could easily be getting unloaded before all finalizers have run, since there's no way to guarantee finalization ordering.

I've pinged a JVM friend to see if I'm on the right track with this. If I am, we have a bit of a conundrum:

How can we make sure we've freed all custom data structures allocated by jffi before jffi's native library unloads?
Is this even possible?

I'm researching. Any thoughts you might have are welcome.

headius · 2017-02-25T06:33:17Z

My JVM friend says my theory is "highly plausible".

And I think I have proof now. I modified your example to extend URLClassLoader and log when it gets finalized, along with logging when these jffi methods attempt to free resources. The last three lines before the segfault:

...
classloader finalized 2014779152
freeing in 2014779152
freeing in 2014779152

So it does indeed seem like the jffi structures are attempting to free, via JNI functions, after those JNI functions have been unloaded.

headius · 2017-02-25T07:08:47Z

Ok, so I think there is some silver lining here. Your particular example is an extreme case, in which JRuby and all its dependencies are getting unloaded, including jffi. If you were to have JRuby and friends in a higher-level classloader, so they don't unload, this would not be a problem.

Typically if people are embedding JRuby, they don't also isolated it to its own classloader. That's the key issue here; by allowing JRuby to completely unload between runs, you are also forcing the JNI library to get unloaded, while there are still resources from jffi remaining to be finalized.

The latter problem is a tricky one to solve, since there are very few hooks we can use to know that the JNI library has been unloaded. Even JNI_onUnload is problematic since we'd need to track all those native resources in the C library, and we'd need to do a release of jffi that rebuilds all our jffi backends across the dozen+ platforms we support.

This is still a serious problem, however, since a WAR-file deployment should also be unloading jffi and causing the native library to unmap, as in your example (which is the best reproduction of this complex problem I've seen, so kudos to you).

I'm going to have to pow-wow with @enebo and weigh some options.

If feasible, a solid workaround for you would be to disable native library support in JRuby via JVM property jruby.native.enabled=false, which will avoid loading (or at least using) the native backend of jffi, avoiding the chicken/egg problem we have here.

Actually solving the problem is going to require some creativity.

See jruby/jruby#4506

headius · 2017-03-01T18:43:52Z

This is not new and not easy to fix. It won't be in 9.1.8.0.

zhanhb · 2017-12-12T02:23:32Z

According to the JLS, the Java language defines no order of the execution of finalize methods, so the actual order is an implementation detail.
JLS 12.6. Finalization of Class Instances:

headius · 2018-03-01T00:03:58Z

@zhanhb Yes...unfortunately I have found no way to work around this problem.

zhanhb · 2018-03-01T09:56:02Z

Register the finalize to native, when the native lib get finalized in method JNI_OnUnload, free all native memory.
Maybe it's a solution.
Limitation: all native memory need to be registered to the finalized method. Other ways to release native memory in method finalize are not permitted.

headius · 2018-03-19T21:58:07Z

@zhanhb In order to do that we'd need to track in the jffi native code all memory allocated via any function so that it can be freed during unload rather than in a finalizer. The root issue here is that we use some native functions to free allocated memory, we call those functions in a finalizer, and the finalizer might run after the JNI library has already been unloaded. This leads us to try to call a JNI function that's no longer there.

zhanhb · 2018-03-20T06:58:37Z

I rewrote jnr-ffi myself. When native block get finalized. All allocated native memory get finalized before native lib.

When allocate memory or dlopen a native lib, register a cleaner. Memory will be cleaned if the reference(Allocated Memory/Native Library) is not reachable.
https://github.com/java-native-call/jnc/blob/cfce88117ff2e428c9b59cc1887fa8f34695a510/jnc-runtime/src/main/java/jnc/foreign/internal/AllocatedMemory.java#L9-#L18
https://github.com/java-native-call/jnc/blob/cfce88117ff2e428c9b59cc1887fa8f34695a510/jnc-runtime/src/main/java/jnc/foreign/internal/AllocatedMemory.java#L37
https://github.com/java-native-call/jnc/blob/cfce88117ff2e428c9b59cc1887fa8f34695a510/jnc-runtime/src/main/java/jnc/foreign/internal/Cleaner.java#L32

When classloader get unloaded, clear all allocated native memories.
https://github.com/java-native-call/jnc/blob/cfce88117ff2e428c9b59cc1887fa8f34695a510/jnc-runtime/src/main/java/jnc/foreign/internal/NativeMethods.java#L12-#L20
https://github.com/java-native-call/jnc/blob/cfce88117ff2e428c9b59cc1887fa8f34695a510/jnc-runtime/src/main/java/jnc/foreign/internal/Cleaner.java#L118

headius · 2018-03-26T21:38:34Z

@zhanhb Very nice! Perhaps we can merge our efforts? We'd be happy to cooperate with you.

Meanwhile I will look at utilizing your change in jnr-ffi.

headius · 2018-03-26T21:45:05Z

What you have seems feasible; however there's a few improvements I might make. I do hope we can work together to bring your changes into jnr-ffi!

The finalized field on the Free object needs to be volatile and checked in an atomic way.
The runnables should not be strongly referenced; your current logic will prevent them from garbage collecting and the collection will continue to grow.
This additional logic should be possible to disable, for single-classloader applications that do not need the extra care around native memory and JNI unload.

Otherwise, I thank you for coming up with an elegant solution! Please let me know how we can work together to improve jnr-ffi!

headius · 2018-03-26T21:47:45Z

The finalized field on the Free object needs to be volatile and checked in an atomic way.

This likely applies to any other Runnable de-allocators you are registering.

blindpirate · 2018-07-06T09:23:52Z

We've also met this issue several times per week. Here's one of the heap dump:

Stack: [0x00007f081a4de000,0x00007f081a5df000],  sp=0x00007f081a5dc990,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [ld-linux-x86-64.so.2+0x9ef4]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 18780  java.lang.ClassLoader$NativeLibrary.find(Ljava/lang/String;)J (0 bytes) @ 0x00007f0822970d4c [0x00007f0822970d00+0x4c]
j  java.lang.ClassLoader.findNative(Ljava/lang/ClassLoader;Ljava/lang/String;)J+49
v  ~StubRoutines::call_stub
j  com.kenai.jffi.Foreign.freeMemory(J)V+0
j  com.kenai.jffi.MemoryIO.freeMemory(J)V+1
j  com.kenai.jffi.NativeMethods$ResourceHolder.finalize()V+8
J 24176% C2 java.lang.ref.Finalizer$FinalizerThread.run()V (55 bytes) @ 0x00007f0822161a38 [0x00007f0822161780+0x2b8]
v  ~StubRoutines::call_stub

Is it possible to run jruby in a new process? Seems jruby.launch.inproc=false is ignored now.

Stewori · 2018-08-11T17:16:03Z

It looks like this issue popped up in the Jython tracker as well: http://bugs.jython.org/issue2701
Is there any progress on this?

@zhanhb I wanted to look at your links posted in #4506 (comment) but they are broken (404).

I was able to kind of repair them, at least to find the files. If someone else wants to have a look, here are the restored links:
https://github.com/java-native-call/jnc-build/blob/master/jnc-runtime/src/main/java/jnc/foreign/internal/AllocatedMemory.java
https://github.com/java-native-call/jnc-build/blob/master/jnc-runtime/src/main/java/jnc/foreign/internal/NativeMethods.java

zhanhb · 2018-08-13T02:35:43Z

Link updated, it's ok now.

headius · 2018-10-26T21:39:49Z

We need help here, like a PR for jffi or jnr-ffi :-( Going to have to bump.

zhanhb · 2019-06-02T15:00:17Z

There's no issue to keep strong reference with the cleaner with. I rewrote the onUnload methods these days, maybe these code can be updated to jffi project.

mike-bourgeous · 2019-09-13T00:33:15Z

I'm having a possibly similar issue in a JRuby project running on Java 11. After upgrading to Java 11 I am getting SIGSEGV in several different native I/O routines during JRuby startup ~90% of the time, so it takes 10x+ as long to deploy the app.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f887098e9b8, pid=8753, tid=8759
#
# JRE version: OpenJDK Runtime Environment (11.0.4+11) (build 11.0.4+11-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.4+11-LTS, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64
# Problematic frame:
# C  0x00007f887098e9b8

Here are some of the locations of the crash:

fcntl

C  0x00007f887098e9b8
j  jnr.posix.BaseNativePOSIX.fcntl(ILjnr/constants/platform/Fcntl;)I+9
j  jnr.posix.CheckedPOSIX.fcntl(ILjnr/constants/platform/Fcntl;)I+6
j  jnr.posix.LazyPOSIX.fcntl(ILjnr/constants/platform/Fcntl;)I+6
j  org.jruby.util.io.PosixShim.setCloexec(IZ)I+8
j  org.jruby.util.io.PosixShim.pipe()[Ljava/nio/channels/Channel;+5
j  org.jruby.util.io.PopenExecutor.pipeOpen(Lorg/jruby/runtime/Thre
j  org.jruby.util.io.PopenExecutor.pipeOpen(Lorg/jruby/runtime/Thre
j  org.jruby.RubyKernel.backquote(Lorg/jruby/runtime/ThreadContext;
j  org.jruby.RubyKernel$INVOKER$s$1$0$backquote.call(Lorg/jruby/run

close

C  0x00007fc0cba990b8
j  jnr.enxio.channels.Native.close(I)I+4
j  jnr.enxio.channels.NativeDeviceChannel.implCloseSelectableChannel()V+4
j  java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel()V+1 java.base@11.0.4
j  java.nio.channels.spi.AbstractInterruptibleChannel.close()V+23 java.base@11.0.4
j  org.jruby.util.io.ChannelFD.finish()V+61
j  org.jruby.util.io.ChannelFD.close()V+1
j  org.jruby.util.io.PosixShim.close(Ljava/io/Closeable;)I+5
j  org.jruby.util.io.PosixShim.close(Lorg/jruby/util/io/ChannelFD;)I+2
j  org.jruby.util.io.OpenFile.finalizeFlush(Lorg/jruby/runtime/ThreadContext;Z)V+282

In each case it kind of looks like libc was loaded at the address in question, but somehow unloaded before the crashing method was called.

I tried putting the latest versions of jnr-ffi, jffi, jnr-posix, and jnr-enxio in the classpath with no improvement.

I opened this issue on jnr-ffi: jnr/jnr-ffi#194

Some crash logs are here: https://gist.github.com/mike-bourgeous/c24c225d86eb26db629e8fb09f57d6a2

Workaround

My final workaround was passing -Djruby.native.enabled=false to the JVM.

Some oddities:

The SIGSEGV only happens in our Kubernetes clusters, not on a Minikube node running the same exact container. There may be memory size differences between the two, so memory pressure might be a factor.

headius · 2020-02-12T15:35:37Z

@mike-bourgeous I suspect you are right in thinking your issue is linked to this one. Unfortunately I'm not sure we have a preferred solution for this yet. The problem is core to how the JVM does classloading and native libraries.

I believe another workaround is to make sure jffi is in a higher-up classloader, so it doesn't go away and unload the library...but I have not confirmed that.

headius added ffi JRuby 9000 labels Feb 24, 2017

headius added this to the JRuby 9.1.8.0 milestone Feb 24, 2017

headius added a commit to jnr/jnr-ffi that referenced this issue Feb 28, 2017

Ensure frees happen atomically and exactly once.

b5f5f32

See jruby/jruby#4506

headius added a commit to jnr/jffi that referenced this issue Feb 28, 2017

Ensure memory is freed exactly once. See jruby/jruby#4506.

d1aabb6

headius modified the milestones: JRuby 9.2.0.0, JRuby 9.1.8.0 Mar 1, 2017

DJRickyB mentioned this issue Mar 1, 2018

update jffi and release version for JRuby 1.7 #5072

Closed

sandeepsrkr mentioned this issue Mar 5, 2018

JVM segmentation fault in logstash 6.2.0 while building rake artifacts:rpm in Docker Container elastic/logstash#9209

Closed

enebo modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 24, 2018

wolfs mentioned this issue Aug 1, 2018

Using the asciidoctor task causes the Gradle daemon to crash asciidoctor/asciidoctor-gradle-plugin#253

Closed

headius modified the milestones: JRuby 9.2.1.0, JRuby 9.3.0.0 Oct 26, 2018

mike-bourgeous mentioned this issue Sep 13, 2019

SIGSEGV on Java 11 jnr/jnr-ffi#194

Open

headius mentioned this issue Aug 5, 2020

jruby-9.0.5.0-complete.jar: leaves stale jffi*.dll files in temp folder #3657

Open

headius modified the milestones: JRuby 9.3.0.0, JRuby 9.4.0.0 Jul 1, 2021

headius modified the milestones: JRuby 9.4.0.0, JRuby 9.5.0.0 Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Segfault in jnr-ffi finalizer #4506

Segfault in jnr-ffi finalizer #4506

headius commented Feb 24, 2017

headius commented Feb 25, 2017

headius commented Feb 25, 2017

headius commented Feb 25, 2017

headius commented Mar 1, 2017

zhanhb commented Dec 12, 2017

headius commented Mar 1, 2018

zhanhb commented Mar 1, 2018 •

edited

Loading

headius commented Mar 19, 2018

zhanhb commented Mar 20, 2018 •

edited

Loading

headius commented Mar 26, 2018

headius commented Mar 26, 2018

headius commented Mar 26, 2018

blindpirate commented Jul 6, 2018

Stewori commented Aug 11, 2018

zhanhb commented Aug 13, 2018

headius commented Oct 26, 2018

zhanhb commented Jun 2, 2019

mike-bourgeous commented Sep 13, 2019 •

edited

Loading

headius commented Feb 12, 2020

Segfault in jnr-ffi finalizer #4506

Segfault in jnr-ffi finalizer #4506

Comments

headius commented Feb 24, 2017

headius commented Feb 25, 2017

headius commented Feb 25, 2017

headius commented Feb 25, 2017

headius commented Mar 1, 2017

zhanhb commented Dec 12, 2017

headius commented Mar 1, 2018

zhanhb commented Mar 1, 2018 • edited Loading

headius commented Mar 19, 2018

zhanhb commented Mar 20, 2018 • edited Loading

headius commented Mar 26, 2018

headius commented Mar 26, 2018

headius commented Mar 26, 2018

blindpirate commented Jul 6, 2018

Stewori commented Aug 11, 2018

zhanhb commented Aug 13, 2018

headius commented Oct 26, 2018

zhanhb commented Jun 2, 2019

mike-bourgeous commented Sep 13, 2019 • edited Loading

Workaround

headius commented Feb 12, 2020

zhanhb commented Mar 1, 2018 •

edited

Loading

zhanhb commented Mar 20, 2018 •

edited

Loading

mike-bourgeous commented Sep 13, 2019 •

edited

Loading