Simultaneous client reads / writes to the same socket from different threads can deadlock #4854

camlow325 · 2017-11-13T22:43:17Z

Environment

Provide at least:

JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)
9.1.13.0, 9.1.14.0, maybe others?
Operating system and platform (e.g. uname -a)
Windows 10, x86_64

Expected Behavior

Describe your expectation of how JRuby should behave, perhaps by showing how CRuby/MRI behaves.

When a client interleaves blocking calls to read data and write data to the same TCP socket from different threads, the client is able to eventually complete the read and write calls with no errors. This is what I've seen with the script below with MRI Ruby 2.4.2, running the script indefinitely. On JRuby 9k, however, the process deadlocks after a few seconds, never seeming to recover.

Provide an executable Ruby script or a link to an example repository.

Here's the script I've used to reproduce this problem. I just ran jruby.exe .\socket_sample.rb to launch the script. The script opens a server socket and accepts a single connection from a client. The script performs interleaved read/write calls on the client side of the socket (with no synchronization). On the server side, reads and writes are performed as well, but all from a single thread.

require 'socket'

server = TCPServer.open(50000)

Thread.start(server) do |server|
  Thread.current.name = "ServerSocket"
  client = server.accept
  server_loops = 1
  while true
    begin
      client.read_nonblock(20000)
      server_write = "Server write #{server_loops}"
      puts(server_write)
      client.puts(server_write)
      server_loops += 1
      sleep(0.1)
    rescue IO::WaitReadable
      puts("Server: waiting for data to read...")
      IO.select([client])
    end
  end
end

client = TCPSocket.open('localhost', 50000)

Thread.start(client) do |client|
  Thread.current.name = "ClientReader"
  client_read_loops = 1
  while true
    data_read = client.gets
    puts("Client received #{client_read_loops}: #{data_read}")
    client_read_loops += 1
  end
end

Thread.current.name = "ClientWriter"
client_write_loops = 1
while true
  client_write = "Client write #{client_write_loops}"
  if client_write_loops % 10000 == 0
    puts(client_write)
  end
  client.puts(client_write)
  client_write_loops += 1
end

Actual Behavior

Describe or show the actual behavior.

Running with MRI Ruby 2.4.2, the script appears to run fine indefinitely, with messages like the following in the output:

Client received 901: Server write 901
Client write 950000
Server write 902
Client received 902: Server write 902
Server write 903
Client received 903: Server write 903

Running with JRuby 9.1.14.0, however, the client read/write threads appear to deadlock one another after only a few seconds of running. The console output looks like the following:

Client write 50000
Server write 9
Client received 9: Server write 9
Server write 10
Server write 11
Server write 12
...
Server write 54
Server: waiting for data to read...

Running jstack against the Java process for a JRuby 9.1.14.0 run after the 'waiting for data to read...' message above, I see the following:

Found one Java-level deadlock:
=============================
"Ruby-0-Thread-2@ClientReader: socket_sample.rb:1":
  waiting for ownable synchronizer 0x00000000fa52ca88, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "main"
"main":
  waiting to lock monitor 0x00000000034e9bf8 (object 0x00000000fa52c920, a java.lang.Object),
  which is held by "Ruby-0-Thread-2@ClientReader: socket_sample.rb:1"

Java stack information for the threads listed above:
===================================================
"Ruby-0-Thread-2@ClientReader: socket_sample.rb:1":
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000fa52ca88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at org.jruby.util.io.OpenFile.lock(OpenFile.java:2672)
        at org.jruby.util.io.OpenFile.removeBlockingThread(OpenFile.java:2606)
        - locked <0x00000000fa528df8> (a org.jruby.util.io.OpenFile)
        at org.jruby.RubyThread.select(RubyThread.java:1811)
        - locked <0x00000000fa52c920> (a java.lang.Object)
        at org.jruby.RubyThread.select(RubyThread.java:1701)
        at org.jruby.util.io.OpenFile.readInternal(OpenFile.java:1368)
        at org.jruby.util.io.OpenFile.fillbuf(OpenFile.java:1264)
        at org.jruby.util.io.OpenFile.getlineFast(OpenFile.java:1594)
        at org.jruby.RubyIO.getlineInner(RubyIO.java:725)
        at org.jruby.RubyIO.getline(RubyIO.java:695)
        at org.jruby.RubyIO.getline(RubyIO.java:683)
        at org.jruby.RubyIO.gets(RubyIO.java:2233)
        at org.jruby.RubyIO$INVOKER$i$gets.call(RubyIO$INVOKER$i$gets.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:125)
        at C_3a_.socket_sample.invokeOther21:gets(C:\socket_sample.rb:30)
        at C_3a_.socket_sample.RUBY$block$\=C\-socket_sample\,rb$1(C:\socket_sample.rb:30)
        at java.lang.invoke.LambdaForm$DMH/468121027.invokeStatic_L8_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/1537458459.invokeExact_MT(LambdaForm$MH)
        at org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:145)
        at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:71)
        at org.jruby.runtime.Block.call(Block.java:124)
        at org.jruby.RubyProc.call(RubyProc.java:289)
        at org.jruby.RubyProc.call(RubyProc.java:246)
        at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:104)
        at java.lang.Thread.run(Thread.java:748)
"main":
        at org.jruby.RubyThread.select(RubyThread.java:1757)
        - waiting to lock <0x00000000fa52c920> (a java.lang.Object)
        at org.jruby.util.io.OpenFile.ready(OpenFile.java:511)
        at org.jruby.util.io.OpenFile.waitWritable(OpenFile.java:452)
        at org.jruby.util.io.OpenFile.waitWritable(OpenFile.java:464)
        at org.jruby.util.io.OpenFile.binwrite(OpenFile.java:2200)
        at org.jruby.util.io.OpenFile.fwrite(OpenFile.java:2056)
        at org.jruby.RubyIO.write(RubyIO.java:1456)
        at org.jruby.RubyIO.write(RubyIO.java:1428)
        at org.jruby.RubyIO$INVOKER$i$1$0$write.call(RubyIO$INVOKER$i$1$0$write.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:153)
        at org.jruby.RubyIO.write(RubyIO.java:2584)
        at org.jruby.RubyIO.putsSingle(RubyIO.java:2562)
        at org.jruby.RubyIO.puts1(RubyIO.java:2478)
        at org.jruby.RubyIO.puts(RubyIO.java:2451)
        at org.jruby.RubyIO$INVOKER$i$puts.call(RubyIO$INVOKER$i$puts.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:153)
        at C_3a_.Users.socket_sample.invokeOther47:puts(C:\socket_sample.rb:43)
        at C_3a_.Users.socket_sample.RUBY$script(C:\socket_sample.rb:43)
        at java.lang.invoke.LambdaForm$DMH/1450495309.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$BMH/242355057.reinvoke(LambdaForm$BMH)
        at java.lang.invoke.LambdaForm$MH/171497379.invoker(LambdaForm$MH)
        at java.lang.invoke.LambdaForm$MH/434176574.invokeExact_MT(LambdaForm$MH)
        at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
        at org.jruby.ir.Compiler$1.load(Compiler.java:95)
        at org.jruby.Ruby.runScript(Ruby.java:828)
        at org.jruby.Ruby.runNormally(Ruby.java:747)
        at org.jruby.Ruby.runNormally(Ruby.java:765)
        at org.jruby.Ruby.runFromMain(Ruby.java:578)
        at org.jruby.Main.doRunFromMain(Main.java:417)
        at org.jruby.Main.internalRun(Main.java:305)
        at org.jruby.Main.run(Main.java:232)
        at org.jruby.Main.main(Main.java:204)

Found 1 deadlock.

The reader thread has taken a lock on the channel within the RubyThread.select method but is waiting to take a lock (which the writer holds) within the OpenFile.lock() method. The writer thread has taken the lock held by the OpenFile object but is waiting to take the lock on the channel (which the reader holds) within the RubyThread.select method.

I've seen this occur not just in cases where the writing thread encounters a waitWritable condition but also in cases where the writer thread is blocked on a call to the synchronized OpenFile.unread method, which cannot proceed because the reader thread is in the middle of executing the synchronized OpenFile.removeBlockingThread method. In this case, the reader thread is unable to obtain the reentrant lock within OpenFile, though, because the writer thread already holds it - leading to another form of deadlock. I haven't yet been able to get a more narrow code sample to reproduce this second issue as frequently, though.

If a mutex is synchronized around the client read and write calls, the deadlocks no longer occur (even for JRuby):

require 'thread'
...
socket_mutex = Mutex.new
...
  socket_mutex.synchronize do
    data_read = client.gets
  end
...
  socket_mutex.synchronize do
    client.puts(client_write)
  end

This seems like a reasonable workaround to the problem, although maybe with some performance implications. It may be that MRI Ruby is more immune to this kind of problem because of the GIL, not sure?

The text was updated successfully, but these errors were encountered:

headius · 2017-11-14T23:15:22Z

I suspect this is bad lock management by the socket subsystem. Most operations in IO have been audited to lock and unlock all appropriate locks, in the same order all the time. Socket has not had such a touch recently.

Diagnosis: Straight-up deadlock in the OpenFile read/write lock, probably due to improper locking in socket library.

headius · 2017-11-14T23:16:38Z

Workaround for you for now would be to lock around these IO objects. It may be enough to require 'jruby/synchronized' and then include JRuby::Synchronized into whatever classes or objects you're using.

There are two changes here, with the first the more likely cause of #4854: 1. While leaving the OpenFile lock locked, we proceeded to attempt to lock the Selector being selected. If the order of these locks happens differently elsewhere, it will cause a deadlock. Fixed by unlocking the OpenFile lock before locking the Selector lock. 2. OpenFile.removeBlockingThread synchronized on both OpenFile's lock and the OpenFile instance itself. I could find only a handful of other places that lock OpenFile and they do not appear to lock anything else, but there's no reason to lock OpenFile here anyway.

headius · 2017-12-05T23:37:54Z

Easy enough problem to find: we were locking two locks that might get locked in a different order elsewhere. I made sure to unlock one them first, so it should at least avoid the deadlock you found.

headius · 2017-12-05T23:38:52Z

I am not sure how best to create a test for this, given the complexity of the original script and the difficulty of testing for deadlocks in CI. If you have ideas, please feel free to submit a PR against spec/ruby or test/jruby.

headius added this to the JRuby 9.1.15.0 milestone Nov 14, 2017

camlow325 mentioned this issue Nov 15, 2017

Synchronize writes and blocking socket reads njh/ruby-mqtt#105

Closed

headius closed this as completed Dec 5, 2017

jonathanswenson mentioned this issue Apr 24, 2019

IO.read_nonblock raises EOFError with concurrent reads #5706

Closed

gerald-dotcom mentioned this issue Dec 22, 2020

Simple Hello World HTTP Server Example? zeromq/cppzmq#453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Simultaneous client reads / writes to the same socket from different threads can deadlock #4854

Simultaneous client reads / writes to the same socket from different threads can deadlock #4854

camlow325 commented Nov 13, 2017 •

edited

Loading

headius commented Nov 14, 2017

headius commented Nov 14, 2017

headius commented Dec 5, 2017

headius commented Dec 5, 2017

Simultaneous client reads / writes to the same socket from different threads can deadlock #4854

Simultaneous client reads / writes to the same socket from different threads can deadlock #4854

Comments

camlow325 commented Nov 13, 2017 • edited Loading

Environment

Expected Behavior

Actual Behavior

headius commented Nov 14, 2017

headius commented Nov 14, 2017

headius commented Dec 5, 2017

headius commented Dec 5, 2017

camlow325 commented Nov 13, 2017 •

edited

Loading