Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errno::EAGAIN thrown by SSLSocket#connect_nonblock #93

Closed
mohamedhafez opened this issue May 29, 2016 · 11 comments · Fixed by jruby/jruby#3952
Closed

Errno::EAGAIN thrown by SSLSocket#connect_nonblock #93

mohamedhafez opened this issue May 29, 2016 · 11 comments · Fixed by jruby/jruby#3952

Comments

@mohamedhafez
Copy link
Contributor

mohamedhafez commented May 29, 2016

A few times a day I'm getting Errno::EAGAIN thrown here in Net::HTTP#connect, I'm not sure if this is a fault of the ruby code not handling Errno::EAGAIN, or if MRI ruby's SSLSocket#connect_nonblock handles Errno::EAGAIN internally.

I'm running a patch that catches the Errno::EAGAIN, sleeps 0.01, and then retries, and it works just fine, if the case is that the ruby code should really be handling this then I can submit a fix to ruby and we can close this issue. If that's not the case though and this should be handled internally by jruby's SSLSocket implementation to match MRI's behavior, then the stack trace points to this error being thrown at:

org/jruby/ext/openssl/SSLSocket.java:225:in `connect_nonblock'

(I have no idea how to reproduce Errno::EAGAIN to see if MRI handles it or not, sorry!)

@kares
Copy link
Member

kares commented May 30, 2016

Thanks Mohamed, unfortunately I'm not sure someone would run into trying to fix this without reproducing.

@mohamedhafez
Copy link
Contributor Author

Now that I think about it, how would SSLSocket#connect_nonblock handle Errno::EAGAIN with out blocking anyway? That error literally means "wait for a bit and then retry", and a nonblocking method shouldn't be sleeping and then retrying... I think the correct fix should be dealing with the error in Net::HTTP#connect. I'm going to ask on the main ruby mailing list real quick and then probably close this issue

@mohamedhafez
Copy link
Contributor Author

Jeez ignore that last one, I obviously know nothing about nonblocking io coding:/ After reading a bunch and looking at the code, I'm guessing that if a socket would block, it should always either be wait readable or wait writable, and there shouldn't ever be a generic "this socket would block on something but I can't tell you if its reading or writing", correct? I'm trying to go through the jruby code myself and see if I can't spot where those EAGAINs might happen, since I also have another couple errors I'm encountering once every few days in production that I don't know how to reproduce, and I'm hoping maybe there is a chance I could figure it out and fix them.

@mohamedhafez
Copy link
Contributor Author

mohamedhafez commented May 30, 2016

I think I might have found what is causing this: If the InterruptedException is thrown here in SSLSocket#waitSelect, it would cause the behavior I'm seeing. If blocking is set to false, as it is in this case, the run method of the RubyThread.BlockingTask will execute very quickly, and you'd expect interrupts to be extremely rare, but I'm guessing there is some high-load condition or race condition or something where an interrupt can get set before the method completes.

Do we really need to execute the code for when blocking is false inside the RubyThread.BlockingTask in the first place, since all the code in this case will be non blocking and execute quickly? Unless I'm missing something, wouldn't moving this code outside of the RubyThread.BlockingTask make things a tiny bit more efficient, in addition to hopefully solving this issue? I'd be happy to put together a pull request if so, just let me know if my reasoning sounds correct.

@mohamedhafez
Copy link
Contributor Author

I ended up implementing what I suggested in the previous comment and am currently running this code in production. However, I'm still seeing the EAGAIN errors, because waitSelect is still returning false here...

I must be completely misunderstanding what key and selectedKeys() are if that is happening... I'll read up and see if I can't figure out what is going there.

@mohamedhafez
Copy link
Contributor Author

So I'm guessing the selector that is retrieved with getSelectorPool might also be used by other threads, watching other channels? So that when selectNow is called, it might return non-zero if some other channel is ready, but not ours? In that case, couldn't we execute this code to return :wait_readable or :wait_writable instead of causing EAGAIN to be raised?

If that sounds reasonable to someone who knows this code, I'd be willing to try it out and report back. (A little hesitant to try it out in production until then since I'm taking a shot in the dark here)

@mohamedhafez
Copy link
Contributor Author

Looking through jruby code, it looks like runtime.getSelectorPool().get() will result in a synchronized checkout from the pool (or the creation of a new selector), so there goes my theory in the previous comment. But then how the hell do we ever end up in a situation where right after selectNow our key isn't in selectedKeys?? We're only checking the first key in selectedKeys actually, maybe there is a another, old key still in there that didn't get cleaned up somehow from a previous use of the selector?

@kares
Copy link
Member

kares commented Jun 2, 2016

very najs session!
your theory makes sense - would be interesting to debug this (since you already run with patches) ...
would log the key and set contents if ( result[0] > 1 ) (more than one key) to see if its included.
also set -Djruby.openssl.debug=true might reveal some stack-traces (e.g. on key cleanup)

@mohamedhafez
Copy link
Contributor Author

mohamedhafez commented Jun 3, 2016

Ok so I've been running code with the extra logging, and been getting some really weird results from this line.

Many times there are indeed two keys in the set returned by selectedKeys(). Sometimes ours is the first key and then we are all good, sometimes ours is the second key and currently the current code doesn't see it because it only checks the first, doing keySet.iterator().next() == key. That would be an easy enough fix, but I don't think there should be any other keys in there in the first place. Sometimes also, there is only one key returned by selectedKeys(), and its not ours.

So I think its safe to say that we are indeed getting selectors that already have keys in them... haven't been able to see where that would happen despite looking for a couple hours... @headius it looks like you were the original author of the SelectorPool class, any thoughts on what might cause this behavior?

One really weird thing I've seen is that I'm only getting these errors in my frontend Puma processes. My backend daemon, which makes a hell of a lot more http/s requests, hasn't shown any unexpected results at all so far (i.e. no output at all from here). Maybe there is some code in Puma or something it relies on that returns selectors to the pool without cancelling its keys? (I can't find anything in Puma itself though) In any case, it would probably be best to have runtime.getSelectorPool().put(selector) cancel the keys and clean up the selector internally instead of depending on whatever library or gem is calling it to have done it... though I only see two places in JRuby code and jruby-openssl code that call this, and both do the same cleanup process...

Oh, I haven't seen any stacktraces from key cleanup, so I don't think its that (it doesn't look like key.cancel() throws anything anyway, I don't think any of the other cleanup stuff going wrong would cause this effect). The only other debug output I've seen are a lot of stuff from the Cipher class, and also when closing a socket and there is some error there for some reason, like the connection was reset by a peer or something, which I don't think has anything to do with this.

@mohamedhafez
Copy link
Contributor Author

I started running this patch in production to clean up keys in SelectorPool#put internally as I suggested in my last comment, I'll report back how it goes...

@mohamedhafez
Copy link
Contributor Author

mohamedhafez commented Jun 3, 2016

Looks like that did the trick! Looking through jruby code some more, I think this line might be the culprit: its the only place that checks if a key is valid before cancelling it, but the key can be invalid because the channel was closed, and perhaps in that case if it doesn't get cancelled it still can be reported by select.

Anyway, I think the correct fix is to ensure the key-cleaning takes place, correctly, in SelectorPool#put internally instead of having the key-cleaning code in (at least) 3 different places. I'll put together a pull request out of the patch from my previous comment and submit it to jruby.

@kares unless you object I think I'll also send in a pull request for a cleaned-up version of my patch to move waitSelect's nonblocking code out of the RubyThread.BlockingTask: I don't think it has anything to do with this bug, but still it would probably save a bit of overhead on a common task I'd imagine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants