Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid byte sequence in UTF-8 for StringIO objects #5309

Closed
prashantvithani opened this issue Sep 11, 2018 · 8 comments
Closed

Invalid byte sequence in UTF-8 for StringIO objects #5309

prashantvithani opened this issue Sep 11, 2018 · 8 comments

Comments

@prashantvithani
Copy link
Contributor

Environment

Provide at least:

  • jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 10.0.2+13-Ubuntu-1ubuntu0.18.04.1 on 10.0.2+13-Ubuntu-1ubuntu0.18.04.1 +indy +jit [linux-x86_64]

Expected Behavior

In MRI,

file = URI.parse(url).open gives StringIO object, which has external_encoding set to ASCII-8BIT . Doing file.read.match(//) works fine as expected.

Actual Behavior

In JRuby

URI.parse(url).open gives StringIO object, which has external_encoding set to UTF-8. Doing file.read.match(//) throws error Invalid byte sequence in UTF-8

The actual encoding of the downloaded content is ASCII-8BIT. If we set the encoding of StringIO object to ASCII-8BIT, and dofile.read.match(//), it works fine again.

@enebo
Copy link
Member

enebo commented Sep 11, 2018

@prashantvithani Any chance you can make a script showing this problem? open is a private method which expects an argument. I am not sure how to reproduce this.

@ahorek
Copy link
Contributor

ahorek commented Sep 11, 2018

it looks like open-uri?

require 'open-uri'
url = "http://xxx"
open(url).string.encoding

@prashantvithani
Copy link
Contributor Author

prashantvithani commented Sep 12, 2018

require 'open-uri'
url = "https://lh3.googleusercontent.com/hGq42jW-gwJc8hWmzbseAsvnEKMSe9ukj8drAp0M8T0NY_Ya4ibxERW5eICoZt0WbpQnsF4=s64"
file = open(url)
if file.is_a?(StringIO)
  begin
    puts file.external_encoding # #<Encoding:UTF-8>
    image = file.read
    image.match(//) # Throws Invalid byte sequence in UTF-8
  rescue => e
    puts e.message
    file.rewind
    file.set_encoding(Encoding::ASCII_8BIT)
    image = file.read 
    image.match(//) # Works this time
  end
end

@enebo This script should help in reproducing. The thing to note here is that the error occurs only if the return type of open(url) is StringIO. I believe the content is saved on Tempfile or StringIO based on the size of the content which is downloaded. It works fine for larger files which gets saved through Tempfile. As mentioned in the script, it works after setting the encoding to ASCII-8BIT.

@headius
Copy link
Member

headius commented Sep 12, 2018

@prashantvithani Confirmed!

The full trace for the error follows (irrelevant bits removed):

[] ~/projects/jruby $ jruby -Xbacktrace.style=full blah.rb
UTF-8
invalid byte sequence in UTF-8
ArgumentError: invalid byte sequence in UTF-8
...
     newArgumentError at org/jruby/Ruby.java:3613
      prepareEncoding at org/jruby/RubyRegexp.java:437
       preparePattern at org/jruby/RubyRegexp.java:460
               search at org/jruby/RubyRegexp.java:1293
             matchPos at org/jruby/RubyRegexp.java:1195
          matchCommon at org/jruby/RubyRegexp.java:1163
              match_m at org/jruby/RubyRegexp.java:1129
                 call at org/jruby/RubyRegexp$INVOKER$i$match_m.gen:-1
                 call at org/jruby/internal/runtime/methods/JavaMethod.java:399
         cacheAndCall at org/jruby/runtime/callsite/CachingCallSite.java:344
                 call at org/jruby/runtime/callsite/CachingCallSite.java:170
              match19 at org/jruby/RubyString.java:1679
                 call at org/jruby/RubyString$INVOKER$i$match19.gen:-1
                 call at org/jruby/internal/runtime/methods/DynamicMethod.java:202
         cacheAndCall at org/jruby/runtime/callsite/CachingCallSite.java:344
                 call at org/jruby/runtime/callsite/CachingCallSite.java:170
  invokeOther10:match at blah.rb:8
               <main> at blah.rb:8

@headius
Copy link
Member

headius commented Sep 12, 2018

There's a chance this has been reported in MRI. Our StringIO attempts to mimc theirs closely, and our open-uri is identical. Failing that we should audit their C code for StringIO and see if there's encoding negotiation logic we're missing.

Just to clarify...the data coming back is binary but StringIO is still calling it UTF-8. The workaround is obviously to force it to BINARY or ASCII-8BIT encoding, as @prashantvithani did here.

@lopex
Copy link
Member

lopex commented Sep 12, 2018

Isnt that a case where MRI tends to mark everything as ascii when there's only 7 bit content like in #5086

@enebo
Copy link
Member

enebo commented Sep 12, 2018

So I figured something out but I am done for today. the StringIO is extended with some module Meta and that does some meta handling of HTTP headers which ends up calling @io.string.force_encoding. We do this and @io.string is ASCII-8BIT but the @io.ptr.encoding is still set to UTF-8. I half wonder if both string.encoding and ptr.encoding in MRI are sharing the same pointer? Otherwise our impls are nearly identical (not in a way which should affect this bug).

I've nearly figured this out though.

@enebo enebo added this to the JRuby 9.1.18.0 milestone Sep 13, 2018
@enebo enebo closed this as completed in 074ca31 Sep 13, 2018
@enebo
Copy link
Member

enebo commented Sep 13, 2018

Will add a spec for this behavior as it is very unintuitive that io.string.force_encoding somehow will change the result of calling io.external_encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants