Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScriptEngine mangling the string encoding when writing to stdout #1617

Closed
hakanai opened this issue Apr 10, 2014 · 3 comments
Closed

ScriptEngine mangling the string encoding when writing to stdout #1617

hakanai opened this issue Apr 10, 2014 · 3 comments
Milestone

Comments

@hakanai
Copy link

hakanai commented Apr 10, 2014

I found that JRuby mangles the string encoding when writing to $stdout. I discovered the issue in v1.7.4 but it also exists in v1.7.11 which I just updated to in an attempt to fix the issue.

I'm still trying to put together a small test which causes the issue which is a bit challenging because all our existing small tests for this sort of issue do pass.

Filling in what I have so far, though, my script is pretty simple:

puts "Copyright \u00A9"

My writer is essentially a StringWriter. Actually it's a DocumentWriter, but that probably doesn't matter. In any case, it's a Writer, so I definitely don't expect to be subject to encoding issues like this.

What I actually get on the writer is:

Copyright ��

Adding #encoding: UTF-8 to the top makes no difference. This also only occurs with file.encoding set to something other than UTF-8 (so for practical purposes, only Windows is affected.)

What I can see in the debugger:

  • RubyIO#write():1408 has str set correctly. It then calls getByteList() and receives the correct UTF-8 bytes.
  • The bytes travel unharmed through ChannelStream, ChannelDescriptor, Channels$WritableByteChannelImpl, arriving at a PrintStream.
  • This PrintStream contains a WriterOutputStream with the encoding set to US-ASCII.

So somewhere in JRuby, a WriterOutputStream is being created with the wrong encoding, thus mangling my bytes on the way back to characters.

Edit:

The issue seems to be that Utils.getRubyIO is creating the WriterOutputStream without specifying the encoding. So even though the correct bytes are written to the OutputStream, this writer then corrupts them on the way back to the Writer I passed in.

Test cases were simpler than expected.

@Test
public void test_Utf8ViaXEscapes() throws Exception {
    assumeTrue(!defaultCharsetCanEncode("\u00A9"));
    ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);
    engine.eval("#encoding: utf-8\n puts \"\\xC2\\xA9\"", context);
    assertThat(writer.toString().trim(), is(equalTo("\u00A9")));
}

@Test
public void test_UnicodeViaUEscape() throws Exception {
    assumeTrue(!defaultCharsetCanEncode("\u00A9"));
    ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);
    engine.eval("#encoding: utf-8\n puts \"\\u00A9\"", context);
    assertThat(writer.toString().trim(), is(equalTo("\u00A9")));
}

private static boolean defaultCharsetCanEncode(String str) {
    Charset charset = Charset.defaultCharset();
    ByteBuffer encoded = charset.encode(str);
    CharBuffer decoded = charset.decode(encoded);
    return str.equals(decoded.toString());
}
@hakanai
Copy link
Author

hakanai commented Apr 10, 2014

Now that I have looked at our existing tests carefully, I see that the passing test we had was really an expected-failing test. So there might be a report about this, perhaps even from me, on one of the trackers already.

@kares kares added this to the Won't Fix milestone Jun 12, 2018
@kares
Copy link
Member

kares commented Jun 12, 2018

closing since its reported against 1.7 which is EOL ... if you feel like this is an issue in latest 9K let us know

@kares kares closed this as completed Jun 12, 2018
@hakanai
Copy link
Author

hakanai commented Jun 13, 2018

This eventually got re-reported (I had forgotten about this one somehow!) as #2403 and fixed.

Unicode is still broken for ScriptContainer, but we don't use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants