ScriptEngine mangling the string encoding when writing to stdout #1617

hakanai · 2014-04-10T00:06:40Z

I found that JRuby mangles the string encoding when writing to $stdout. I discovered the issue in v1.7.4 but it also exists in v1.7.11 which I just updated to in an attempt to fix the issue.

I'm still trying to put together a small test which causes the issue which is a bit challenging because all our existing small tests for this sort of issue do pass.

Filling in what I have so far, though, my script is pretty simple:

puts "Copyright \u00A9"

My writer is essentially a StringWriter. Actually it's a DocumentWriter, but that probably doesn't matter. In any case, it's a Writer, so I definitely don't expect to be subject to encoding issues like this.

What I actually get on the writer is:

Copyright ��

Adding #encoding: UTF-8 to the top makes no difference. This also only occurs with file.encoding set to something other than UTF-8 (so for practical purposes, only Windows is affected.)

What I can see in the debugger:

RubyIO#write():1408 has str set correctly. It then calls getByteList() and receives the correct UTF-8 bytes.
The bytes travel unharmed through ChannelStream, ChannelDescriptor, Channels$WritableByteChannelImpl, arriving at a PrintStream.
This PrintStream contains a WriterOutputStream with the encoding set to US-ASCII.

So somewhere in JRuby, a WriterOutputStream is being created with the wrong encoding, thus mangling my bytes on the way back to characters.

Edit:

The issue seems to be that Utils.getRubyIO is creating the WriterOutputStream without specifying the encoding. So even though the correct bytes are written to the OutputStream, this writer then corrupts them on the way back to the Writer I passed in.

Test cases were simpler than expected.

@Test
public void test_Utf8ViaXEscapes() throws Exception {
    assumeTrue(!defaultCharsetCanEncode("\u00A9"));
    ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);
    engine.eval("#encoding: utf-8\n puts \"\\xC2\\xA9\"", context);
    assertThat(writer.toString().trim(), is(equalTo("\u00A9")));
}

@Test
public void test_UnicodeViaUEscape() throws Exception {
    assumeTrue(!defaultCharsetCanEncode("\u00A9"));
    ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);
    engine.eval("#encoding: utf-8\n puts \"\\u00A9\"", context);
    assertThat(writer.toString().trim(), is(equalTo("\u00A9")));
}

private static boolean defaultCharsetCanEncode(String str) {
    Charset charset = Charset.defaultCharset();
    ByteBuffer encoded = charset.encode(str);
    CharBuffer decoded = charset.decode(encoded);
    return str.equals(decoded.toString());
}

The text was updated successfully, but these errors were encountered:

hakanai · 2014-04-10T13:46:25Z

Now that I have looked at our existing tests carefully, I see that the passing test we had was really an expected-failing test. So there might be a report about this, perhaps even from me, on one of the trackers already.

kares · 2018-06-12T16:13:08Z

closing since its reported against 1.7 which is EOL ... if you feel like this is an issue in latest 9K let us know

hakanai · 2018-06-13T00:47:17Z

This eventually got re-reported (I had forgotten about this one somehow!) as #2403 and fixed.

Unicode is still broken for ScriptContainer, but we don't use it.

kares added this to the Won't Fix milestone Jun 12, 2018

kares added the JRuby 1.7.x label Jun 12, 2018

kares closed this as completed Jun 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScriptEngine mangling the string encoding when writing to stdout #1617

ScriptEngine mangling the string encoding when writing to stdout #1617

hakanai commented Apr 10, 2014

hakanai commented Apr 10, 2014

kares commented Jun 12, 2018

hakanai commented Jun 13, 2018

ScriptEngine mangling the string encoding when writing to stdout #1617

ScriptEngine mangling the string encoding when writing to stdout #1617

Comments

hakanai commented Apr 10, 2014

hakanai commented Apr 10, 2014

kares commented Jun 12, 2018

hakanai commented Jun 13, 2018