Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File.stat sometimes fails with SystemCallError in JRuby 1.7.21 #3145

Closed
rsim opened this issue Jul 16, 2015 · 33 comments
Closed

File.stat sometimes fails with SystemCallError in JRuby 1.7.21 #3145

rsim opened this issue Jul 16, 2015 · 33 comments

Comments

@rsim
Copy link

rsim commented Jul 16, 2015

We have a downloadable application where we embed jruby-complete jar and in the latest version we updated JRuby from 1.7.20 to 1.7.21. After this update for several customers bundler initialization started to fail with

SystemCallError: Unknown error - Unknown Error (0) - LICENSE
                   stat at org/jruby/RubyFile.java:894
   validate_permissions at classpath:/META-INF/jruby.home/lib/ruby/shared/rubygems/specification.rb:2716

As I checked in this line File.stat on a "LICENSE" file from a gem is done and in RubyFile.java the line

return runtime.newFileStat(f, false);

raises this SystemCallError. When we change jruby-complete from 1.7.21 to 1.7.20 then the application starts without an error.

We are investigating what is common on these customer servers that might cause this error (as majority of customers do not have this problem). All these customers are using different Linux versions and different Java 7 or 8 versions. From one customer we learned that the application is installed on a NAS storage so I suspect that maybe the error appears when File.stat is done on a file that is located on a NAS mounted file system. Will updated this issue as I get more information.

I am trying to find what changes might cause this error. I checked that jnr-posix version 3.0.12 is the same in JRuby 1.7.20 and 1.7.21.

@mkristian I was checking the commit log and was wondering that maybe one of these commits 0137667, a50803b or edac27b could cause this error in File.stat execution.

@enebo
Copy link
Member

enebo commented Jul 16, 2015

In talking to @rsim a little on irc I can see we have not updated jnr-posix since 1.7.20 and also this is dying trying to stat within jnr-posix. My only thought is we must be passing in filename differently? I know we changed how we canonicalize requred files and there were also changes mentioned in the previous comments.

I think running with JRUBY_OPTS='-d -Xbacktrace.style=raw' might pick up some more info but since the only single extra data point is this might be happening accessing a file on a netapp appliance mounted to a linux machine. If I had to make a wild wild guess I would say we end up passing in the wrong pathname to stat which happens to cause the error.

@enebo enebo added this to the JRuby 1.7.22 milestone Jul 16, 2015
@mkristian
Copy link
Member

the file to test comes from a Specification#files - @rsim any chance to see the original gemspec file which causes this error ?

@rsim
Copy link
Author

rsim commented Jul 16, 2015

@mkristian I suspect that it is just the first gemspec and the first "LICENSE" file there and that File.stat would fail on any file on that file system.

If I won't be able to repeat this issue then I will try to ask this one customer if they could allow me to connect to their server and experiment with JRuby 1.7.21 and see if I can reproduce the issue with simple File.stat call.

@rsim
Copy link
Author

rsim commented Jul 17, 2015

@enebo Digging more into File.stat implementation I suspect that -Xbacktrace.style=raw won't give more information about this error as I think that SystemCallError is raised in https://github.com/jruby/jruby/blob/1.7.21/core/src/main/java/org/jruby/RubyFileStat.java#L125

if (stat == null) throw runtime.newErrnoFromInt(file.errno(), filename);

So it seems that file.stat() does not raise an exception but returns null, and based on the error message file.errno() is 0 and filename is "LICENSE".

So I am wondering how to debug why file.stat() is null for some customers.

Also I got more details from a different customer that had this issue and they do not have any network mounted filesystem - everything is on local filesystem. So my initial guess that this might be caused by a network mounted filesystem is wrong.

I will try to add some debug messages to find out the current directory for the "LICENSE" file to see if it will help to find something strange.

@mkristian
Copy link
Member

the currentDirectory should give some hints since https://github.com/jruby/jruby/blob/1.7.21/core/src/main/java/org/jruby/RubyFileStat.java#L122 where things did change between the two releases.

@rsim
Copy link
Author

rsim commented Jul 17, 2015

Debugged more details and found out that the issue was not caused by changes in JRuby 1.7.21 but by changes in bundler between versions 1.9.x and 1.10.x and due to jnr-possix file stat error code for missing files. Here are the details.

Found out that this "LICENSE" file was referenced from one private gem but during the build this file was not included in the application package. When we built the application with JRuby 1.7.21 we also used the latest bundler version 1.10.5 (but previously we used bundler 1.9.6). The latest bundler version during setup is calling validate method on this private gem (see https://github.com/bundler/bundler/blob/master/lib/bundler/rubygems_integration.rb#L50-L54) but the previous bundler version 1.9.6 was not doing this. And bundler is ignoring Errno::ENOENT exception during this validate method.

And so the root cause is that File.stat on majority of systems is correctly raising Errno::ENOENT exception. But on some system file.stat() on a missing file is returning file.errno() as 0 and therefore general SystemCallError error is raised which is not rescued by bundler.

In our application we will fix this issue by removing this invalid reference to "LICENSE" file in the gemspec for this private gem.

Should this be tracked further as a jnr-possix bug that file.stat() is not returning correct error code for a missing file?

@mkristian
Copy link
Member

in the recent past there were quite a few issue with file.stat() when native did not get loaded and jnr-posix switch over to a java implementation. somehow I do rememeber that the java impl behaves quite differently.

so my guess here is that on some linux the native ffi does not load for example when /tmp is not writable. jruby-complete.jar comes with the ".so" file packed but needs to copy it onto the filesystem before it can load it. I know some linux distributions (sabayon) make their /tmp read-only-root or so.

could this explain the random behaviour here ?

@enebo
Copy link
Member

enebo commented Jul 20, 2015

@rsim @mkristian I have seen weird returns from stat on Windows with native enabled when hitting UNC mounted filesystems. In that case the error return was not 0 but had really weird return values. I do not think we have figured out the best way to address this. If we fix it in jnr-posix it means we are deciding for all other projects that use it how we should error. If we fix it in JRuby then it means all other projects need to do mapping hacks if they do not want to cope with all the exotic return values.

In any case, your problem is even stranger (at least to me)...

@rsim I am really curious how this ends up as errno() 0. If as @mkristian is considering that native does not load I can actually see a code path where we can fail and return 0 from stat. In JavaLibCHelper any IOException thrown setting up the stat leads to a return 0 (note the TODOs in that part of the source). This looks super wrong, but I would love to get a reproduction on this before we change it so we can try and make some more tests.

@rsim Could you open an issue on jnr-posix. I think @mkristian idea of jnr-posix native not loading might be worth investigating on the customer system. Having them type:

jruby -Xnative.verbose=true -e 1

Should end up printing:

 Successfully loaded native POSIX impl.

If it doesn't then we definitely have a problem with our pure-Java version. It also means there may be some additional issues to work through since they are failing to load native.

@rsim
Copy link
Author

rsim commented Jul 27, 2015

I have one customer who reproduced this issue from the irb console. I asked them to execute

java -jar jruby-complete-1.7.21.jar -S irb

and then they got SystemCallError from File.stat:

irb(main):019:0> File.stat("missing file")
SystemCallError: Unknown error - Unknown Error (0) - missing file
from org/jruby/RubyFile.java:894:in `stat'
from (irb):19:in `evaluate'

But now I don't know how to debug it further. When I tried to execute jruby -Xnative.verbose=true -e 1 both on Mac OS X and on Ubuntu 12.04, then I got no output. Should JRuby 1.7.21 on some other systems print out Successfully loaded native POSIX impl.? In that case pure-Java version is not the cause as on our Linux server and on Mac OS X we get

irb(main):001:0> File.stat("missing file")
Errno::ENOENT: No such file or directory - missing file
        from org/jruby/RubyFile.java:894:in `stat'
        from (irb):1:in `evaluate'

Any other suggestions what can we try on the customer server to identify the cause for this? And it would be better if we could use java -jar jruby-complete-1.7.21.jar -S irb command line as downloading just one jar file on that server is easier than extracting all JRuby installation.

@enebo
Copy link
Member

enebo commented Jul 27, 2015

@rsim ah from jruby-complete I don't think we support -X directly. Try:

java -Djruby.native.verbose=true  -jar jruby-complete.jar -e 1

@rsim
Copy link
Author

rsim commented Jul 27, 2015

@enebo When I execute

java -Djruby.native.verbose=true  -jar jruby-complete-1.7.21.jar -e 1

both on my Mac OS X and on my Ubuntu 12.04 then I do not see any output. Does it mean that I have pure-Java version of jnr-possix? Then pure-Java version File.stat is working correctly with missing files.

@rsim
Copy link
Author

rsim commented Jul 27, 2015

@enebo Ah, when I execute

$ java -Djruby.native.verbose=true -jar jruby-complete-1.7.21.jar -e "File.stat 'missing file'"
Successfully loaded native POSIX impl.
Errno::ENOENT: No such file or directory - missing file
    stat at org/jruby/RubyFile.java:894
  (root) at -e:1

then I see that native POSIX is loaded. Will ask to execute this also on our customer server.

@mkristian
Copy link
Member

just realized we can switch off native via a cli:

$ java -Djruby.native.enabled=false -jar jruby-complete-1.7.22-SNAPSHOT.jar -e "File.stat 'missing file'"
SystemCallError: Unknown error - Unknown Error (0) - missing file
    stat at org/jruby/RubyFile.java:894
  (root) at -e:1

so it is the native not loaded !

@rsim still wondering why native does not load for those customers. is /tmp writable for their java process ? (just thinking)

@rsim
Copy link
Author

rsim commented Jul 28, 2015

@mkristian @enebo Got the response from the customer and as I see the reason for failing to load native POSIX implementation is failing to load missing libcrypt.so library. Also checked that they have write access rights to /tmp. Where should this libcrypt.so shared library be located?

$ /opt/java/jdk1.7u51/bin/java -Djruby.native.verbose=true -jar jruby-complete-1.7.21.jar -e "File.stat
'missing file'"
Failed to load native POSIX impl; falling back on Java impl. Stacktrace
follows.
java.lang.UnsatisfiedLinkError: libcrypt.so: cannot open shared object
file: No such file or directory
at
jnr.ffi.provider.jffi.NativeLibrary.loadNativeLibraries(NativeLibrary.java:87)
at
jnr.ffi.provider.jffi.NativeLibrary.getNativeLibraries(NativeLibrary.java:70)
at
jnr.ffi.provider.jffi.NativeLibrary.getSymbolAddress(NativeLibrary.java:49)
at
jnr.ffi.provider.jffi.NativeLibrary.findSymbolAddress(NativeLibrary.java:59)
at
jnr.ffi.provider.jffi.AsmLibraryLoader.generateInterfaceImpl(AsmLibraryLoader.java:117)
at
jnr.ffi.provider.jffi.AsmLibraryLoader.loadLibrary(AsmLibraryLoader.java:59)
at
jnr.ffi.provider.jffi.NativeLibraryLoader.loadLibrary(NativeLibraryLoader.java:43)
at jnr.ffi.LibraryLoader.load(LibraryLoader.java:287)
at jnr.ffi.Library.loadLibrary(Library.java:127)
at
jnr.posix.POSIXFactory$DefaultLibCProvider$SingletonHolder.<clinit>(POSIXFactory.java:271)
at
jnr.posix.POSIXFactory$DefaultLibCProvider.getLibC(POSIXFactory.java:275)
at jnr.posix.BaseNativePOSIX.<init>(BaseNativePOSIX.java:36)
at jnr.posix.LinuxPOSIX.<init>(LinuxPOSIX.java:18)
at jnr.posix.POSIXFactory.loadLinuxPOSIX(POSIXFactory.java:143)
at jnr.posix.POSIXFactory.loadNativePOSIX(POSIXFactory.java:121)
at jnr.posix.POSIXFactory.loadPOSIX(POSIXFactory.java:90)
at jnr.posix.LazyPOSIX.loadPOSIX(LazyPOSIX.java:35)
at jnr.posix.LazyPOSIX.posix(LazyPOSIX.java:31)
at jnr.posix.LazyPOSIX.allocateStat(LazyPOSIX.java:49)
at
org.jruby.util.RegularFileResource.stat(RegularFileResource.java:125)
at org.jruby.RubyFileStat.setup(RubyFileStat.java:123)
at org.jruby.RubyFileStat.newFileStat(RubyFileStat.java:97)
at org.jruby.Ruby.newFileStat(Ruby.java:3342)
at org.jruby.RubyFile.stat(RubyFile.java:894)
at
org.jruby.RubyFile$INVOKER$s$1$0$stat.call(RubyFile$INVOKER$s$1$0$stat.gen)
at
org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)
at
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)
at ruby.__dash_e__.__file__(-e:1)
at ruby.__dash_e__.load(-e)
at org.jruby.Ruby.runScript(Ruby.java:867)
at org.jruby.Ruby.runScript(Ruby.java:860)
at org.jruby.Ruby.runNormally(Ruby.java:729)
at org.jruby.Ruby.runFromMain(Ruby.java:578)
at org.jruby.Main.doRunFromMain(Main.java:395)
at org.jruby.Main.internalRun(Main.java:290)
at org.jruby.Main.run(Main.java:217)
at org.jruby.Main.main(Main.java:197)
SystemCallError: Unknown error - Unknown Error (0) - missing file
stat at org/jruby/RubyFile.java:894
(root) at -e:1

@rsim
Copy link
Author

rsim commented Jul 28, 2015

@mkristian @enebo Got more information from the customer which might provide a hint what is wrong. I asked them to list all libraries which have crypt in their name and I got from them:

$ ls -l /lib/lib*crypt* /usr/lib/lib*crypt* /usr/local/lib*crypt*
ls: cannot access /usr/lib/lib*crypt*: No such file or directory
ls: cannot access /usr/local/lib*crypt*: No such file or directory
lrwxrwxrwx 1 root root 18 Nov 5 2008 /lib/libxcrypt.so.1 -> libxcrypt.so.1.2.4
-rw-r--r-- 1 root root 51472 Nov 5 2008 /lib/libxcrypt.so.1.2.4

When I execute this on my Ubuntu 12.04 then I get

$ ls -l /lib/lib*crypt* /usr/lib/lib*crypt* /usr/local/lib*crypt*
ls: cannot access /usr/local/lib*crypt*: No such file or directory
lrwxrwxrwx 1 root root     22 Apr 14  2012 /lib/libcryptsetup.so.4 -> libcryptsetup.so.4.0.0
-rw-r--r-- 1 root root  97072 Apr 14  2012 /lib/libcryptsetup.so.4.0.0
lrwxrwxrwx 1 root root     20 Feb  5  2013 /usr/lib/libecryptfs.so.0 -> libecryptfs.so.0.0.0
lrwxrwxrwx 1 root root     20 Feb  5  2013 /usr/lib/libecryptfs.so.0.0 -> libecryptfs.so.0.0.0
-rw-r--r-- 1 root root 148648 Feb  5  2013 /usr/lib/libecryptfs.so.0.0.0

So as I see they have libxcrypt library but I have libcryptsetup library (which both as I understand are replacement for libcrypt library). Could it be so that native POSIX implementation fails to load when libxcrypt library is used? Or probably just the old version of libxcrypt are causing this (based on the date and version number it seems quite old)?

@mkristian
Copy link
Member

@rsim this is very similar to #2913 (comment)

I guess we should be able to load the libcrypt more reliable (IMO)

@rsim
Copy link
Author

rsim commented Jul 28, 2015

@mkristian Yes, this is the same root cause and the same workaround worked. They have the following libcrypt* files:

$ ls -l /lib/*/libcrypt* /usr/lib/*/libcrypt*
-rw-r--r-- 1 root root 43288 Sep 30 2013 /lib/x86_64-linux-gnu/libcrypt-2.15.so
-rw-r--r-- 1 root root 1612544 Apr 24 2012 /lib/x86_64-linux-gnu/libcrypto.so.0.9.8
-rw-r--r-- 1 root root 1930616 May 2 2014 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
lrwxrwxrwx 1 root root 16 Sep 30 2013 /lib/x86_64-linux-gnu/libcrypt.so.1 -> libcrypt-2.15.so
lrwxrwxrwx 1 root root 40 Apr 24 2012 /usr/lib/x86_64-linux-gnu/libcrypto.so.0.9.8 -> /lib/x86_64-linux-gnu/libcrypto.so.0.9.8

After creating a symbolic link

 ln -s /lib/x86_64-linux-gnu/libcrypt.so.1 /usr/lib/x86_64-linux-gnu/libcrypt.so

native POSIX implementation is loaded and the correct Errno::ENOENT exception is raised:

$ /opt/java/jdk1.7u51/bin/java -Djruby.native.verbose=true -jar jruby-complete-1.7.21.jar -e "File.stat 'missing file'"
Successfully loaded native POSIX impl.
Errno::ENOENT: No such file or directory - missing file
stat at org/jruby/RubyFile.java:894
(root) at -e:1

So probably the solution would be to look not just for libcrypt.so file but also for libcrypt.so.* or libcrypt-*.so files.

mkristian added a commit to mkristian/jnr-ffi that referenced this issue Aug 6, 2015
using the tar archive of oracle jdk on ubuntu does not
setup the load library path for finding libcrypt (used by jnr-posix).
this results in failed native support for jnr-posix. installing the
jdk through some debian/ubuntu repositories do not show this problem.

this patch also looks at /etc/ld.so.conf and /etc/ld.so.conf.d/* to
setup the internal search path. with this the above combination of
oracle jdk and ubuntu does find the native libraries for jnr-posix.

fixes jruby/jruby#2913
and jruby/jruby#3145
probably also elastic/logstash#3127 (comment)

Sponsored by Lookout Inc.
@enebo enebo modified the milestones: JRuby 1.7.22, JRuby 1.7.23 Aug 20, 2015
@enebo enebo modified the milestones: JRuby 1.7.23, JRuby 1.7.24 Nov 24, 2015
@enebo
Copy link
Member

enebo commented Jan 19, 2016

@mkristian is this still broken? I think we have updated jnr-ffi since then (or lib which deps on it)...

@enebo enebo removed this from the JRuby 1.7.24 milestone Jan 20, 2016
@enebo enebo modified the milestones: JRuby 1.7.25, JRuby 1.7.24 Jan 20, 2016
@atambo
Copy link
Member

atambo commented Feb 12, 2016

Did this fix make it into jruby 1.7.24? I've recently got this following error:

https://gist.github.com/atambo/acf070000c74f91466e7

When trying to run:

./java -Djruby.native.verbose=true -jar jruby-complete-1.7.24.jar -e 'load "META-INF/jruby.home/bin/jirb"'

on a redhat 7.1 machine.

@atambo
Copy link
Member

atambo commented Feb 12, 2016

When I run ls -l /lib64/libcrypt*
on redhat 7.1 I get:

-rwxr-xr-x. 1 root root   40816 Jan 19  2015 /lib64/libcrypt-2.17.so
lrwxrwxrwx. 1 root root      19 Oct 21 13:35 /lib64/libcrypto.so.10 -> libcrypto.so.1.0.1e
-rwxr-xr-x. 1 root root 2012800 Jun 23  2015 /lib64/libcrypto.so.1.0.1e
lrwxrwxrwx. 1 root root      22 Feb 24  2015 /lib64/libcryptsetup.so.4 -> libcryptsetup.so.4.6.0
-rwxr-xr-x. 1 root root  162512 Dec 18  2014 /lib64/libcryptsetup.so.4.6.0
lrwxrwxrwx. 1 root root      16 Feb 24  2015 /lib64/libcrypt.so.1 -> libcrypt-2.17.so

@headius
Copy link
Member

headius commented Feb 13, 2016

@atambo A hunch: try -J-Djava.library.path=<path to lib64> and see if that fixes it? Someone else reported an issue finding native libs and this fixed it for them. We may still have some issues with finding lib paths.

@atambo
Copy link
Member

atambo commented Feb 15, 2016

@headius, when I run ./java -Djruby.native.verbose=true -Djava.library.path=/lib64 -jar jruby-complete-1.7.24.jar -e 'load "META-INF/jruby.home/bin/jirb"'

I get:

Successfully loaded native POSIX impl.

So it looks like jruby just doesn't look in /lib64.

@enebo
Copy link
Member

enebo commented Feb 15, 2016

@headius (and anyone else interested) This is not the right place for this comment but I am adding it just so we start to think about it. Perhaps this path loading list we hardcode in jnr-ffi should be made to be separately releasable? I know it is in a maven artifact and I do not really want another artifact but every OS and version of OS seems to move this stuff around or have wrinkles. It is also temporal in nature. Where will a new Ubuntu store something? If we could just gem release (I am less interested in solving this on Java side) updated location info then we would not need to release the planet of jnr-* artifacts neccesarily.

@mkristian
Copy link
Member

@enebo google says that redhat is using /etc/ld.so.conf and /etc/ld.so.conf.d/* as ubuntu/debian/mint are doing. there are no hardcoded paths in the code. so it is hard to say what is going wrong here not finding /lib64

would need to install redhat or centos for debugging

@atambo
Copy link
Member

atambo commented Apr 11, 2016

@mkristian, when I look at /etc/ld.so.conf in the latest official centos docker image (https://hub.docker.com/r/library/centos/) I see:

include ld.so.conf.d/*.conf

and when I look at the /etc/ld.so.conf.d/ directory I see nothing inside that directory.

@mkristian
Copy link
Member

@atambo thanx for this info and then I am not surprised we can not find "any" library :P

@enebo enebo modified the milestones: JRuby 1.7.26, JRuby 1.7.25 Apr 11, 2016
@enebo
Copy link
Member

enebo commented Apr 11, 2016

This probably should have been made into a second issue since @atambo issue is I am guessing different in how we process ld.so.conf files but I will just leave it open and advance to 1.7.26,

@mkristian
Copy link
Member

@atambo when I use following Dockerfile

FROM centos

RUN yum update -y
RUN yum install -y java-1.8.0-openjdk-devel

VOLUME /projects

ENV JAVA_HOME /usr/lib/jvm/jre

WORKDIR /projects/active/jruby

CMD bash -l

and execute

java -Djruby.native.verbose=true -jar jruby-complete-1.7.24.jar -e 'load "META-INF/jruby.home/bin/jirb"'

and do see following java.load.path

irb(main):003:0> ENV_JAVA['java.library.path']
=> "/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib"

how did you install java ? I see something like ./java which looks odd (or not).

@atambo
Copy link
Member

atambo commented Apr 14, 2016

I was using the IBM JRE so that may be the underlying reason.

@mkristian
Copy link
Member

@atambo ok - let me see how to install the IBM JRE then :)

@enebo enebo modified the milestones: JRuby 1.7.26, JRuby 1.7.27 Aug 26, 2016
@jhyoty
Copy link

jhyoty commented Sep 8, 2016

We had a similar problem (JRuby 1.7.25 and 1.7.26) on RHEL 7 and after some debugging found out that on CentOS/RHEL /etc/ld.so.conf.d does not contain references to the standard library paths (e.g. /lib64, apparently the standard paths are hard-coded in the dynamic linker). Therefore, if one sets the java.library.path system property to some custom value, on RHEL JNR will only search that path for libraries, failing to find any and eventually resorting to the Java System.mapLibraryName call (see jnr.ffi.Platform and jnr.ffi.LibraryLoader). If the system property is not overridden, at least OpenJDK will populate it with the system default library paths and everything works.

The posix interface tries to load "libc.so.6" and "crypt" -libraries, latter of which is present in the system with the name libcrypt.so.1. JNR search algorithm would normally find it, but in this case the mapLibraryName fall-back will produce the name "libcrypt.so" and loading the library and the native posix interface fails.

@headius
Copy link
Member

headius commented Mar 15, 2017

I'm not sure what there is to fix here. Since the last reports, the libraries in question have been updated to do a better job of searching for needed libraries. I'm going to close this as invalid. If someone's still affected by this on 1.7 (or 9k) please open a new issue with your reproduction.

@splashing-atom
Copy link

For posterity, I resolved this same issue by removing the 'noexec' flag from a tmpdir mount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants