Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue after v1.7.6 #2419

Closed
mrbrdo opened this issue Jan 4, 2015 · 6 comments
Closed

Encoding issue after v1.7.6 #2419

mrbrdo opened this issue Jan 4, 2015 · 6 comments

Comments

@mrbrdo
Copy link
Contributor

mrbrdo commented Jan 4, 2015

So I've been having problems with encodings after upgrading from JRuby 1.7.6 (I think the next version I tried after that was 1.7.8, but surely 1.7.11).

Reproduce in 1.7.18:

require 'nokogiri'
require 'open-uri'

Nokogiri::HTML(open("https://www.youtube.com/results?search_query=bears+in+exile+guppy+gmma+indicator+part+2")).to_html
.rvm/gems/jruby-1.7.18/gems/nokogiri-1.6.5-java/lib/nokogiri/xml/node.rb:834:in `native_write_to':
""\xC4\x8D"" from UTF-8 to US-ASCII (Encoding::UndefinedConversionError)

It works fine with 1.7.6 (also works on MRI 2.1.2). I'm not sure if 1.7.11 had the exact same issue, I just know that I had issues with encodings.
Note: .to_html is crucial to trigger the error.

@mrbrdo mrbrdo changed the title Encoding issue since 1.7.6 Encoding issue after v1.7.6 Jan 4, 2015
@cheald
Copy link
Contributor

cheald commented Jan 4, 2015

I've reproduced this on 1.7.18, but it works as intended on HEAD as of fb9199f.

@headius
Copy link
Member

headius commented Jan 4, 2015

This is an ugly one to fix in 1.7.

The logic for StringIO#write in MRI calls rb_str_conv_env, which calls rb_str_conv_enc_opts, which calls rb_econv_convert. Given the result from rb_econv_convert, it either increases the buffer size for the target buffer (if the result is "destination buffer full"), or it returns the newly-transcoded buffer (if the result is "finished"), or it returns the original string and ignores the result altogether (all other results, including errors). This results in behavior that seems incorrect on MRI, but which raises an error on JRuby:

Code:

sio = StringIO.new('123'.force_encoding('US-ASCII'))
p sio.string.encoding
sio.write('øøø')
p sio.string
p sio.string.encoding

Output on MRI:

#<Encoding:US-ASCII>
"\xC3\xB8\xC3\xB8\xC3\xB8"
#<Encoding:US-ASCII>

Clearly this string is no longer US-ASCII because of the multibyte UTF-8 characters, but this is how MRI handles the situation.

The basic problem is that our transcoder is set up to always raise an error if there's a result from transcoding. Our StringIO#write calls roughly equivalent logic in Transcoder.strConvEnc/Opts, but the logic differs in two important ways:

  • It never sees the result of the encoding and therefore does not fall back on the input string as in MRI.
  • The downstream code it calls in CharsetTranscoder always raises an error for failed conversion.

The challenge of fixing this is that most of the transcoding consumers in our codebase depend on the transcoder raising errors on its own. I tried to just remove the throw lastError in CharsetTranscoder.transcode, but many tests failed because they no longer errored.

The safest way to fix this would be to make a separate path for our strConvEncOpts that handles the error appropriately outside CharsetTranscoder, and make sure all paths using strConvEncOpts are using it properly.

headius added a commit that referenced this issue Jan 4, 2015
When there's an error from encoding, MRI's version of this
function just returns the incoming bytes as its result. Ours used
a path that always raises an error when transcoding fails. The new
version uses a path more in line with MRI that requires an out
buffer and returns a result, so we can handle it appropriately.

Fixes #2419.
@headius
Copy link
Member

headius commented Jan 4, 2015

I have pushed a fix to branch test-fix-2419 for you to try out. It appears to pass MRI tests and RubySpec locally.

@headius headius added this to the JRuby 1.7.19 milestone Jan 4, 2015
@headius headius self-assigned this Jan 4, 2015
@mrbrdo
Copy link
Contributor Author

mrbrdo commented Jan 5, 2015

I am testing it now, I'll know tomorrow if there are still any issues, will let you know.
Thanks!

@mrbrdo
Copy link
Contributor Author

mrbrdo commented Jan 5, 2015

So far so good.

@mrbrdo
Copy link
Contributor Author

mrbrdo commented Jan 6, 2015

@headius that worked, no more errors. You can close this.
Thanks!

@enebo enebo closed this as completed Jan 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants