Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some data can cause String#encode to hang #2856

Closed
marshalium opened this issue Apr 20, 2015 · 7 comments
Closed

Some data can cause String#encode to hang #2856

marshalium opened this issue Apr 20, 2015 · 7 comments
Milestone

Comments

@marshalium
Copy link
Contributor

I have code that attempts to remove invalid characters by converting an input (supposedly) UTF-8 string to UTF-16 and back to UTF-8.

If given a string of random binary data instead of mostly valid UTF-8, the first call to encode (UTF-8 -> UTF-16) can hang and appears to never return. I wrote a test script that demonstrates a case where this happens consistently. It does not happen with all random data, but it's pretty easy to find a case that does this by just randomly generating bytes.

#!/usr/bin/env ruby

require 'base64'

b64_str = <<-EOS
1UflCDMEvBUpiI0dm73ApUSGAZBpWnk3HnW2mt68zj5LtNPJJTJ7eZi3Fr8Y
zhIb3lANSO4OMtTCHZLzPREbFS4i42tNbpO49AeFnBF963sDMUzQCSXFNlKx
6zZlAz342wWs90mJkcJEvr6SyqzXyA98NNa3K6G1DOjWhTYb52pkAEB5CF3i
mL5FKPHrvT+grhY3fgUeSlWU/Ozux0vwtD6iaO5l/aGr9yzBD8mDyZASxNOe
g+Bq+q8DUuh97W5O6wMYs2/X+p10/mXwT87MvOk+hSFk8EUVPesaY4T306eq
o6eJbCsBVIrQRqKGnh5vKkzWoXdg+f7bdqhTn4Ut3L/8gRsTB0/dAp6p1S6F
YvZVqRiW75pIYhcrHmtJFAcV49fmyjgvyuSwfcVXuIzELA3scTXQoPjJZqLS
t8YZeHm2fsQgEX28QqbW27SBMPIF7juznZhge2c8TmNN3AWdZOBFY3qFYYKH
aMG20j9XMXxEY2dqrw33XBcc2D/czO0rdmzK/XBdotMWqpZAgyE0k86FfG35
Weqd/PcTOCbwHFLZUHcSO5+2bFeTGHlf77ThLkNGYM7+RuWY2IY9X3f3PMa7
bWo0UaFFrEzfq23DVz32Sm5YDt0ygwgLV7XhnJ8B7a5gNtg8c/WZvNnzYdbN
KWfaU79yQmgbIda8PO33hXxCygAQMsPON5lWqXrulxjge+LhlDiIQvKkZA4P
pDL9k6MI81YzkBWjFeOkK4LCWsdiQylk0w0xNR9/PhwNIWCWsatsR4YK9kEF
4dKF4OooLO1xw/4P0BkdvHSP1EQm/CvAMdi7ZW48/j4yJbwNnwl9OBAizout
ZJ4dSun5hEnuRovB1OeyOz7E6Nc17F8Qrq0WRgaR0KzMxuyK9teHIu5GSoYg
e57sUEmDxrD0h1f1Vq4uFgplW504zSXCl1An7F/FGOXihNStze+KP+gnqq2T
iQ5OlcVorCIur8RRQwzRGayBiSjmDVlmpzclnqw6TytWBeY/q6rtMmBLoV3g
OqAOgiRiM5DexPw0ZxLnuJA2TACkn3Yeikka72YONTgZeU9ZPdgYM+d4y+hX
AvwaiIdOJBAoOHizAFBs4RcrRst88DftnZZAZh5CenV12nn0YF1eaoLeHMdH
XWkBxn5Ihi6mMTRlZ2tviHMalkIyJtXrwoeqHuWqq2N3r+FGWIJAqbmE1wIr
GvbObjMyGsLs3KJd0qtDxamU+gicMpf4gzW/dbFLuFhOiR8asrFvEZETssBW
YzwBN+i6izcmhbo8BzFBf9T9jQl56XQHrR1sn4GmEaemJQuXRSxovu8pMgYR
SW9cG4sF2JepxwSkuXyFx1gp02R0CBHE3D6oul4k1PZ+/qs6DvoOL+RE87BI
0o7LttNbp9hOrGhx6g7m9NoIEkemmj805G7MlDzyR9PuDygcC8Qzf4a8aCku
fs7ayAW7+hZxyxryE7V6V1caVHiCi6XQPw5RLuH6ukDttb4QltTguIch/qNS
AUKtWMJxjKLJxkKWv+W9GLpbhiXsfQAu1Z6Tkln7x3rsCiJNZPv76UhmiPcP
buor2LV7nbw6bjFkYKMSNwdJJG5iVV54zao7EZ0boxVzXba+iQFCUesDSaDg
auBBx5N9bXQRsidr+oi9RHwu9KE5RtPtACXgFrvnqtw7CPd4UbuyXOe+GZUP
66FHyBQUHhW21xvkcLb1L2E52RQuleJu71SYVVqsob5d+SH2QMbUjUsbTmiZ
OBvsMDl62sVSKfGMZL41LcMYmfxA4fEXZehIYgaTwpTPFo/Opme/nbcBoszT
OZzb96IUnXh9ej+9qFC+fB6yB5FDHQd/LFiP7ULaDlIoAXZYXjidwFWFGXld
PG4oqarLnLAxqy31olu39HERJq7+Fxw97++CejMvuOkWYjGCdYcOkZ70CNl1
40jimz00oACiv3/QqxcRyDluXzbs9iweJ+KbdISwxrOqC17l11LENa9Wybeg
tAIkR5OzNKEDM7cYdnDjnt4JXGo/3OiNo06W4AcNH+I0a/gu6SwozDc+htyP
gfhpGwBaeL+KPOEvn6JXLyvE6DBGGPWBr0JXewgOqx/gyOIp1zlqB1peBNKl
gOx8jd+PoYvEAqTjmDKzXHICUE8s07YKbyHlmxI9srv9Ffhw88+EuHuHAhfr
91JFOeWo49nwXA8tKUVOckypJHgb0475NkLeAII5275RYXBU4EO9jjdtY8px
YpkhgP6UTXHA+rDbutlp9sDXL+iftBXfH8bsyJqbBSYcgYMPgF15I702CQuB
xiG3xlw7Uq0Xi/2DiWGn14ibS4zW5QMaQWTWVlEYRQVO7/AWMGWU6/Blg3yM
vCAwhIFY59ZbLbHoNABI1jXatNYIM7Q3W0kIsfqaxZXqAHDzW5YvXo16r1L2
mWuNFGmZBVMG/qTfXDk9boORvxG0eLnfVmou9orNSke6eWX3L8/EIuYrEyFx
d6o8q96iN/zbn720NNj0C3wCEvitk9+2RhMqAbT2VcXz/HcZV2GNnI0FP03r
0XyqgBXo4snc/BOcO/9FdJ5kkAORHCdlTETZJydyI1oVaatIZoRZrqqi/jg3
9tYuEIHyBavZ3n8V8lupuWwv3dQcDJAz9QgqYZ4Lv3FbulLWEIPwyxsK8+1O
BhrOQScB1WdstQNCg+3Ihqo+RxaCK6nDZZrDQMCXTPtfS58n+b7+5WUpPVN4
8y/sOrYIDt2tAeaprRUnIZMupHEh4dE=
EOS

data = Base64.decode64(b64_str)
data.force_encoding("UTF-8")

puts "Before calling encode: data.length=#{data.length}"
data = data.encode("UTF-16", :undef => :replace, :invalid => :replace, :replace => '')
puts "After calling encode: data.length=#{data.length}"

Here's me running it on MRI Ruby. It took less than 1 second:

$ ruby --version
ruby 2.0.0p353 (2013-11-22 revision 43784) [x86_64-darwin14.0.0]
$ time ~/tmp/encoding_test.rb 
Before calling encode: data.length=1965
After calling encode: data.length=2242

real  0m0.095s
user  0m0.054s
sys 0m0.040s

Here's me running it on the latest stable JRuby. I gave up after 4 minutes:

$ ruby --version
jruby 1.7.19 (1.9.3p551) 2015-01-29 20786bd on Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13 +jit [darwin-x86_64]
$ time ~/tmp/encoding_test.rb 
Before calling encode: data.length=1965
^C
real  4m33.090s
user  4m33.700s
sys 0m0.770s

Here's what that thread was doing before I killed it:

$ jstack 69020
"main" prio=5 tid=0x00007fd3aa802800 nid=0xf07 runnable [0x0000000109060000]
   java.lang.Thread.State: RUNNABLE
        at org.jruby.util.encoding.CharsetTranscoder$TranscoderEngine.encode(CharsetTranscoder.java:660)
        at org.jruby.util.encoding.CharsetTranscoder$TranscoderEngine.transcode(CharsetTranscoder.java:530)
        at org.jruby.util.encoding.CharsetTranscoder.primitiveConvert(CharsetTranscoder.java:333)
        at org.jruby.util.encoding.CharsetTranscoder.transcode(CharsetTranscoder.java:294)
        at org.jruby.util.encoding.CharsetTranscoder.transcode(CharsetTranscoder.java:236)
        at org.jruby.util.io.EncodingUtils.transcodeLoop(EncodingUtils.java:874)
        at org.jruby.util.io.EncodingUtils.strTranscode0(EncodingUtils.java:802)
        at org.jruby.util.io.EncodingUtils.strTranscode(EncodingUtils.java:737)
        at org.jruby.util.io.EncodingUtils.strEncode(EncodingUtils.java:708)
        at org.jruby.RubyString.encode(RubyString.java:7619)
        at org.jruby.RubyString$INVOKER$i$encode.call(RubyString$INVOKER$i$encode.gen)
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:346)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:204)
        at Users.mscorcio.tmp.encoding_test.__file__(/Users/mscorcio/tmp/encoding_test.rb:58)
        at Users.mscorcio.tmp.encoding_test.load(/Users/mscorcio/tmp/encoding_test.rb)
        at org.jruby.Ruby.runScript(Ruby.java:866)
        at org.jruby.Ruby.runScript(Ruby.java:859)
        at org.jruby.Ruby.runNormally(Ruby.java:728)
        at org.jruby.Ruby.runFromMain(Ruby.java:577)
        at org.jruby.Main.doRunFromMain(Main.java:395)
        at org.jruby.Main.internalRun(Main.java:290)
        at org.jruby.Main.run(Main.java:217)
        at org.jruby.Main.main(Main.java:197)
@nirvdrum
Copy link
Contributor

I tried reproducing this, but both JRuby 1.7.19 and master produced results almost immediately. There is a bit of a disparity in the resulting data length, however. MRI 2.2 and JRuby master report 1121, whereas JRuby 1.7.19 reports 2242 (mirroring your results for MRI 2.0.0p353).

@marshalium
Copy link
Contributor Author

I've been able to consistently reproduce on OSX and LInux. I also had someone else reproduce it on their OSX machine.

I'm on OSX 10.10.3 and Java 7. JRuby is installed with rbenv.

$ rbenv --version
rbenv 0.4.0
$ java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

What else can I do to help find out what's causing this?

@nirvdrum
Copy link
Contributor

Not that it's a great answer, but if you could open a gist with a few of these strings that are problematic, I'd be happy to check them out. I'll try a few Java versions, too.

@marshalium
Copy link
Contributor Author

In this gist there are 12 base64 encoded strings that have this problem. And it also includes a script I created to randomly create strings and run encode on them until it finds one that hangs.

https://gist.github.com/marshalium/e6bf93b06949890c695d

@headius
Copy link
Member

headius commented Apr 29, 2015

I could not get master to fail in any case, but this isn't surprising...the transcoder is now identical to MRI's.

JRuby 1.7 worked ok on Java 8u40, but on Java 7u67 I was able to reproduce your results with both your original script and on the longer-running random search.

So it seems there may be a Java bug here. I will investigate a bit to see if I can improve my transcoder to avoid this problem.

headius added a commit that referenced this issue May 5, 2015
Java 7 appears to raise different errors for some cases, and these
cases were not handled in the encoder loop. As a result, they
could trigger an infinite loop on bad input. This appeared to be
in the form of cleaved UTF-16 surrogate pairs leading to underflow
where in Java 8 those pairs do not get cleaved.

Fixes #2856.
@headius headius closed this as completed May 5, 2015
@headius
Copy link
Member

headius commented May 5, 2015

I've fixed the CharsetTranscoder so it should properly handle bad input as well as Java 7's unusual errors.

headius added a commit that referenced this issue May 5, 2015
@marshalium
Copy link
Contributor Author

Awesome. Thanks for taking the time to fix this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants