Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

hakanai · 2015-01-01T22:20:13Z

The following tests fail with -Dfile.encoding=windows-1252 but pass with -Dfile.encoding=UTF-8 :

import java.io.StringWriter;
import java.io.Writer;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

Most likely, the failure output you get will be confusing as well:

java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"

The "Expected" line is "?????" because Java is encoding the output as windows-1252.

The "but" line is "?????" because JRuby has encoded the strings to windows-1252 internally and then written and returned the question marks. I find it particularly odd that it would do this, both because the script is passed as a string directly from Java in the first place, but also because the script itself clearly says the strings are UTF-8.

This was JRUBY-4890 on the old tracker. The script had to be updated a bit to have a #encoding: UTF-8 directive because JRuby now complains if you omit it.

The text was updated successfully, but these errors were encountered:

headius · 2015-02-24T21:25:27Z

I added this test on JRuby 1.7 HEAD and master HEAD and it passed in both places. I also confirmed that file.encoding defaults to Cp1252. Am I doing something wrong?

hakanai · 2015-02-25T08:59:26Z

Issue still occurs in 1.7.13 (what I had on my home computer when I got home), 1.7.19 (downloaded now) and 9.0.0.0.pre1 (downloaded now). I verified all three on OSX and specified the -Dfile.encoding system property on the command-line.

The full command-lines I used:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.13.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.19.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-9.0.0.0.pre1.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Of course in all cases, removing the file.encoding parameter makes the test pass.

Admittedly I had to fix compilation of the test itself as well, so here it is for completeness:

import java.io.Writer;
import java.io.StringWriter;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

hakanai · 2015-02-25T11:28:42Z

Managed to build trunk, same deal:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:junit-4.12.jar:jruby-complete-9.0.0.0-SNAPSHOT.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters
JUnit version 4.12
.E.E
Time: 2.434
There were 2 failures:
1) testCharacterEncodingViaScriptContainer(TestUnicodeCharacters)
java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
...

hakanai · 2016-03-14T03:44:05Z

Verifying that this not only still occurs on 9.0.5.0, but now it occurs for me on OSX under IDEA 15, so one of those or something else updated and now it occurs in a new situation. Maybe this means others can now confirm the issue, so perhaps the chance of getting a fix will increase?

Current content of our test case:

@Test
public void testUnicodeStringOutputDirectlyViaScriptEngine() throws Exception
{
    String orig = "\u30C6\u30B9\u30C8";

    ScriptEngine engine = new JRubyEngineFactory().getScriptEngine();

    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);

    Object returnedResult = engine.eval("#encoding: utf-8\n" +
                                        "str = \"" + orig + "\" ; puts str ; str\n", context);

    assertEquals("Wrong result returned", orig, returnedResult);
    assertEquals("Wrong result printed", orig, writer.toString().trim());
}

The failure looks like this:

org.junit.ComparisonFailure: Wrong result returned 
Expected :???
Actual   :???

The test is apparently running with -Dfile.encoding=US-ASCII so IDEA refuses to show the text. If I change it to UTF-8 then it passes, so I can't get sensible output which shows the expected text. But you can look at the test case and see that I'm not expecting question marks, at least.

If you comment out the "Wrong result returned" check, it fails on "Wrong result printed".

hakanai · 2016-03-14T03:58:59Z

In the debugger, in EmbedRubyRuntimeAdapterImpl#runParser, I have:

* input = "#encoding: utf-8\nstr = "テスト" ; puts str ; str\n"
* ...
* node = {org.jruby.ast.RootNode@6656} "(RootNode 1, (BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1)))"
    * scope = {org.jruby.runtime.scope.ManyVarsDynamicScope@6653} "Static Type[819314039]: block [str=null]\n  Static Type[650867604]: local []"
    * staticScope = {org.jruby.parser.StaticScope@6665} "StaticScope(EVAL):[str]"
    * bodyNode = {org.jruby.ast.BlockNode@6666} "(BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1))"
        * list = {org.jruby.ast.Node[3]@6672} 
            * 0 = {org.jruby.ast.DAsgnNode@6675} "(DAsgnNode:str 1, (StrNode 1))"
                * name = "str"
                * location = 0
                * valueNode = {org.jruby.ast.StrNode@6692} "(StrNode 1)"
                    * value = {org.jruby.util.ByteList@6696} "???"
                        * bytes = {byte[15]@6699} 
                            * 0 = 63
                            * 1 = 63
                            * 2 = 63
                            * 3 = 0
                            * 4 = 0
                            * 5 = 0
                            * 6 = 0
                            * 7 = 0
                            * 8 = 0
                            * 9 = 0
                            * 10 = 0
                            * 11 = 0
                            * 12 = 0
                            * 13 = 0
                            * 14 = 0
                        * begin = 0
                        * realSize = 3
                        * encoding = {org.jcodings.specific.UTF8Encoding@6700} "UTF-8"
                        * hash = 0
                        * stringValue = "???"

So it has already been destroyed at parse-time, explaining why both the result and the output are wrong.

hakanai · 2016-03-14T04:02:45Z

Ruby.java line 2750 appears to be the culprit:

public Node parseEval(String content, String file, DynamicScope scope, int lineNumber) {
    addEvalParseToStats();
    return parser.parse(file, content.getBytes(), scope, new ParserConfiguration(this, lineNumber, false, false, config));
}

Call to String#getBytes() with no Charset parameter. Have you ever considered running the forbidden-apis tool over JRuby? It would have found this one. :)

Well, I can't figure out how this was supposed to work, but basically by the time getBytes() is called, the string is now "#encoding: utf-8\nstr = "???" ; puts str ; str\n", so its Unicode content was destroyed before the parser could even get to the line telling it not to do that.

headius · 2016-03-14T14:25:13Z

@trejkaz You may be right about the culprit here. Sorry this one slipped through the cracks.

We do use file.encoding implicitly to encode Java strings into Ruby strings, assuming that's what the user would want. But I think perhaps that's incorrect logic; the user will probably want Strings coming from Java to get encoded the same way multibyte strings read in via IO get encoded, and that generally means UTF-8.

headius · 2016-03-14T15:07:30Z

I have a fix for the bytes going in, but the output being returned is still getting mangled.

headius · 2016-03-14T15:26:17Z

Ok, two fixes for this one:

The getBytes @trejkaz pointed out was not encoding incoming Strings properly, and
The construction of a WriterOutputStream in org.jruby.embed.jsr223.Utils was not decoding output properly.

Fix coming. @trejkaz Can we incorporate your test case into our suite? If you have any more you'd like to add, we'd really appreciate it!

* Incoming scripts should be decoded from String to byte[] using default internal encoding, rather than trusting JDK's file.encoding to be appropriate. * Outgoing streams wrapping writers should decode strings based on current default internal encoding. Fixes #2403

hakanai · 2016-03-14T22:40:48Z

Feel free to include that test in the JRuby suite. It will be good for us as it stops a regression creeping in.

As far as others... I just had a look through all our tests and there is nothing else about encoding in our suite, other than a second nearly identical test which I'll probably remove now that I have found it.

Thanks for the quick fix! :D

headius added JRuby 1.7.x windows encoding labels Feb 24, 2015

headius added this to the JRuby 9.1.0.0 milestone Mar 14, 2016

headius closed this as completed in af965d1 Mar 14, 2016

headius self-assigned this Mar 14, 2016

hakanai mentioned this issue Jun 13, 2018

ScriptEngine mangling the string encoding when writing to stdout #1617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

hakanai commented Jan 1, 2015

headius commented Feb 24, 2015

hakanai commented Feb 25, 2015

hakanai commented Feb 25, 2015

hakanai commented Mar 14, 2016

hakanai commented Mar 14, 2016

hakanai commented Mar 14, 2016

headius commented Mar 14, 2016

headius commented Mar 14, 2016

headius commented Mar 14, 2016

hakanai commented Mar 14, 2016

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Comments

hakanai commented Jan 1, 2015

headius commented Feb 24, 2015

hakanai commented Feb 25, 2015

hakanai commented Feb 25, 2015

hakanai commented Mar 14, 2016

hakanai commented Mar 14, 2016

hakanai commented Mar 14, 2016

headius commented Mar 14, 2016

headius commented Mar 14, 2016

headius commented Mar 14, 2016

hakanai commented Mar 14, 2016