-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some data can cause String#encode to hang #2856
Comments
I tried reproducing this, but both JRuby 1.7.19 and master produced results almost immediately. There is a bit of a disparity in the resulting data length, however. MRI 2.2 and JRuby master report 1121, whereas JRuby 1.7.19 reports 2242 (mirroring your results for MRI 2.0.0p353). |
I've been able to consistently reproduce on OSX and LInux. I also had someone else reproduce it on their OSX machine. I'm on OSX 10.10.3 and Java 7. JRuby is installed with rbenv.
What else can I do to help find out what's causing this? |
Not that it's a great answer, but if you could open a gist with a few of these strings that are problematic, I'd be happy to check them out. I'll try a few Java versions, too. |
In this gist there are 12 base64 encoded strings that have this problem. And it also includes a script I created to randomly create strings and run encode on them until it finds one that hangs. |
I could not get master to fail in any case, but this isn't surprising...the transcoder is now identical to MRI's. JRuby 1.7 worked ok on Java 8u40, but on Java 7u67 I was able to reproduce your results with both your original script and on the longer-running random search. So it seems there may be a Java bug here. I will investigate a bit to see if I can improve my transcoder to avoid this problem. |
Java 7 appears to raise different errors for some cases, and these cases were not handled in the encoder loop. As a result, they could trigger an infinite loop on bad input. This appeared to be in the form of cleaved UTF-16 surrogate pairs leading to underflow where in Java 8 those pairs do not get cleaved. Fixes #2856.
I've fixed the CharsetTranscoder so it should properly handle bad input as well as Java 7's unusual errors. |
Awesome. Thanks for taking the time to fix this! |
I have code that attempts to remove invalid characters by converting an input (supposedly) UTF-8 string to UTF-16 and back to UTF-8.
If given a string of random binary data instead of mostly valid UTF-8, the first call to encode (UTF-8 -> UTF-16) can hang and appears to never return. I wrote a test script that demonstrates a case where this happens consistently. It does not happen with all random data, but it's pretty easy to find a case that does this by just randomly generating bytes.
Here's me running it on MRI Ruby. It took less than 1 second:
Here's me running it on the latest stable JRuby. I gave up after 4 minutes:
Here's what that thread was doing before I killed it:
The text was updated successfully, but these errors were encountered: