New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid byte sequence in UTF-8 for StringIO objects #5309
Comments
@prashantvithani Any chance you can make a script showing this problem? open is a private method which expects an argument. I am not sure how to reproduce this. |
it looks like open-uri?
|
require 'open-uri'
url = "https://lh3.googleusercontent.com/hGq42jW-gwJc8hWmzbseAsvnEKMSe9ukj8drAp0M8T0NY_Ya4ibxERW5eICoZt0WbpQnsF4=s64"
file = open(url)
if file.is_a?(StringIO)
begin
puts file.external_encoding # #<Encoding:UTF-8>
image = file.read
image.match(//) # Throws Invalid byte sequence in UTF-8
rescue => e
puts e.message
file.rewind
file.set_encoding(Encoding::ASCII_8BIT)
image = file.read
image.match(//) # Works this time
end
end @enebo This script should help in reproducing. The thing to note here is that the error occurs only if the return type of |
@prashantvithani Confirmed! The full trace for the error follows (irrelevant bits removed):
|
There's a chance this has been reported in MRI. Our StringIO attempts to mimc theirs closely, and our open-uri is identical. Failing that we should audit their C code for StringIO and see if there's encoding negotiation logic we're missing. Just to clarify...the data coming back is binary but StringIO is still calling it UTF-8. The workaround is obviously to force it to |
Isnt that a case where MRI tends to mark everything as ascii when there's only 7 bit content like in #5086 |
So I figured something out but I am done for today. the StringIO is extended with some module Meta and that does some meta handling of HTTP headers which ends up calling @io.string.force_encoding. We do this and @io.string is ASCII-8BIT but the @io.ptr.encoding is still set to UTF-8. I half wonder if both string.encoding and ptr.encoding in MRI are sharing the same pointer? Otherwise our impls are nearly identical (not in a way which should affect this bug). I've nearly figured this out though. |
Will add a spec for this behavior as it is very unintuitive that io.string.force_encoding somehow will change the result of calling io.external_encoding. |
Environment
Provide at least:
Expected Behavior
In MRI,
file = URI.parse(url).open
givesStringIO
object, which hasexternal_encoding
set toASCII-8BIT
. Doingfile.read.match(//)
works fine as expected.Actual Behavior
In JRuby
URI.parse(url).open
givesStringIO
object, which hasexternal_encoding
set toUTF-8
. Doingfile.read.match(//)
throws errorInvalid byte sequence in UTF-8
The actual encoding of the downloaded content is
ASCII-8BIT
. If we set the encoding ofStringIO
object toASCII-8BIT
, and dofile.read.match(//)
, it works fine again.The text was updated successfully, but these errors were encountered: