Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters are lost when embedding JRuby even if the calling code performs no conversion to byte[] #2403

Closed
hakanai opened this issue Jan 1, 2015 · 10 comments

Comments

@hakanai
Copy link

hakanai commented Jan 1, 2015

The following tests fail with -Dfile.encoding=windows-1252 but pass with -Dfile.encoding=UTF-8 :

import java.io.StringWriter;
import java.io.Writer;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

Most likely, the failure output you get will be confusing as well:

java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"

The "Expected" line is "?????" because Java is encoding the output as windows-1252.

The "but" line is "?????" because JRuby has encoded the strings to windows-1252 internally and then written and returned the question marks. I find it particularly odd that it would do this, both because the script is passed as a string directly from Java in the first place, but also because the script itself clearly says the strings are UTF-8.

This was JRUBY-4890 on the old tracker. The script had to be updated a bit to have a #encoding: UTF-8 directive because JRuby now complains if you omit it.

@headius
Copy link
Member

headius commented Feb 24, 2015

I added this test on JRuby 1.7 HEAD and master HEAD and it passed in both places. I also confirmed that file.encoding defaults to Cp1252. Am I doing something wrong?

@hakanai
Copy link
Author

hakanai commented Feb 25, 2015

Issue still occurs in 1.7.13 (what I had on my home computer when I got home), 1.7.19 (downloaded now) and 9.0.0.0.pre1 (downloaded now). I verified all three on OSX and specified the -Dfile.encoding system property on the command-line.

The full command-lines I used:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.13.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-1.7.19.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:jruby-complete-9.0.0.0.pre1.jar:junit-4.12.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters

Of course in all cases, removing the file.encoding parameter makes the test pass.

Admittedly I had to fix compilation of the test itself as well, so here it is for completeness:

import java.io.Writer;
import java.io.StringWriter;

import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jruby.embed.ScriptingContainer;
import org.junit.Test;

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.*;

public class TestUnicodeCharacters {
    String orig = "\u6625\u304C\u6765\u305F\u3002";
    String scriptlet = "#encoding: UTF-8\n" +
                       "str = \"" + orig + "\"\n" +
                       "puts str\n" +
                       "str\n";
    Writer writer = new StringWriter();

    @Test
    public void testCharacterEncodingViaScriptEngine() throws Exception {
        ScriptEngine engine = new ScriptEngineManager().getEngineByExtension("rb");
        ScriptContext context = engine.getContext();
        context.setWriter(writer);
        String result = (String) engine.eval(scriptlet, context);
        checkValues(result);
    }

    @Test
    public void testCharacterEncodingViaScriptContainer() throws Exception {
        ScriptingContainer container = new ScriptingContainer();
        container.setWriter(writer);
        String result = (String) container.runScriptlet(scriptlet);
        checkValues(result);
    }

    private void checkValues(String returnedResult) {
        assertThat(returnedResult, is(equalTo(orig)));
        assertThat(writer.toString().trim(), is(equalTo(orig)));
    }
}

@hakanai
Copy link
Author

hakanai commented Feb 25, 2015

Managed to build trunk, same deal:

Rika:encoding trejkaz$ java -Dfile.encoding=windows-1252 -classpath hamcrest-all-1.3.jar:junit-4.12.jar:jruby-complete-9.0.0.0-SNAPSHOT.jar:. org.junit.runner.JUnitCore TestUnicodeCharacters
JUnit version 4.12
.E.E
Time: 2.434
There were 2 failures:
1) testCharacterEncodingViaScriptContainer(TestUnicodeCharacters)
java.lang.AssertionError: 
Expected: is "?????"
     but: was "?????"
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
...

@hakanai
Copy link
Author

hakanai commented Mar 14, 2016

Verifying that this not only still occurs on 9.0.5.0, but now it occurs for me on OSX under IDEA 15, so one of those or something else updated and now it occurs in a new situation. Maybe this means others can now confirm the issue, so perhaps the chance of getting a fix will increase?

Current content of our test case:

@Test
public void testUnicodeStringOutputDirectlyViaScriptEngine() throws Exception
{
    String orig = "\u30C6\u30B9\u30C8";

    ScriptEngine engine = new JRubyEngineFactory().getScriptEngine();

    ScriptContext context = new SimpleScriptContext();
    StringWriter writer = new StringWriter();
    context.setWriter(writer);

    Object returnedResult = engine.eval("#encoding: utf-8\n" +
                                        "str = \"" + orig + "\" ; puts str ; str\n", context);

    assertEquals("Wrong result returned", orig, returnedResult);
    assertEquals("Wrong result printed", orig, writer.toString().trim());
}

The failure looks like this:

org.junit.ComparisonFailure: Wrong result returned 
Expected :???
Actual   :???

The test is apparently running with -Dfile.encoding=US-ASCII so IDEA refuses to show the text. If I change it to UTF-8 then it passes, so I can't get sensible output which shows the expected text. But you can look at the test case and see that I'm not expecting question marks, at least.

If you comment out the "Wrong result returned" check, it fails on "Wrong result printed".

@hakanai
Copy link
Author

hakanai commented Mar 14, 2016

In the debugger, in EmbedRubyRuntimeAdapterImpl#runParser, I have:

* input = "#encoding: utf-8\nstr = "テスト" ; puts str ; str\n"
* ...
* node = {org.jruby.ast.RootNode@6656} "(RootNode 1, (BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1)))"
    * scope = {org.jruby.runtime.scope.ManyVarsDynamicScope@6653} "Static Type[819314039]: block [str=null]\n  Static Type[650867604]: local []"
    * staticScope = {org.jruby.parser.StaticScope@6665} "StaticScope(EVAL):[str]"
    * bodyNode = {org.jruby.ast.BlockNode@6666} "(BlockNode 1, (DAsgnNode:str 1, (StrNode 1)), (FCallNode:puts 1, (ArrayNode 1, (DVarNode:str 1)), null), (DVarNode:str 1))"
        * list = {org.jruby.ast.Node[3]@6672} 
            * 0 = {org.jruby.ast.DAsgnNode@6675} "(DAsgnNode:str 1, (StrNode 1))"
                * name = "str"
                * location = 0
                * valueNode = {org.jruby.ast.StrNode@6692} "(StrNode 1)"
                    * value = {org.jruby.util.ByteList@6696} "???"
                        * bytes = {byte[15]@6699} 
                            * 0 = 63
                            * 1 = 63
                            * 2 = 63
                            * 3 = 0
                            * 4 = 0
                            * 5 = 0
                            * 6 = 0
                            * 7 = 0
                            * 8 = 0
                            * 9 = 0
                            * 10 = 0
                            * 11 = 0
                            * 12 = 0
                            * 13 = 0
                            * 14 = 0
                        * begin = 0
                        * realSize = 3
                        * encoding = {org.jcodings.specific.UTF8Encoding@6700} "UTF-8"
                        * hash = 0
                        * stringValue = "???"

So it has already been destroyed at parse-time, explaining why both the result and the output are wrong.

@hakanai
Copy link
Author

hakanai commented Mar 14, 2016

Ruby.java line 2750 appears to be the culprit:

public Node parseEval(String content, String file, DynamicScope scope, int lineNumber) {
    addEvalParseToStats();
    return parser.parse(file, content.getBytes(), scope, new ParserConfiguration(this, lineNumber, false, false, config));
}

Call to String#getBytes() with no Charset parameter. Have you ever considered running the forbidden-apis tool over JRuby? It would have found this one. :)

Well, I can't figure out how this was supposed to work, but basically by the time getBytes() is called, the string is now "#encoding: utf-8\nstr = "???" ; puts str ; str\n", so its Unicode content was destroyed before the parser could even get to the line telling it not to do that.

@headius
Copy link
Member

headius commented Mar 14, 2016

@trejkaz You may be right about the culprit here. Sorry this one slipped through the cracks.

We do use file.encoding implicitly to encode Java strings into Ruby strings, assuming that's what the user would want. But I think perhaps that's incorrect logic; the user will probably want Strings coming from Java to get encoded the same way multibyte strings read in via IO get encoded, and that generally means UTF-8.

@headius headius added this to the JRuby 9.1.0.0 milestone Mar 14, 2016
@headius
Copy link
Member

headius commented Mar 14, 2016

I have a fix for the bytes going in, but the output being returned is still getting mangled.

@headius
Copy link
Member

headius commented Mar 14, 2016

Ok, two fixes for this one:

  • The getBytes @trejkaz pointed out was not encoding incoming Strings properly, and
  • The construction of a WriterOutputStream in org.jruby.embed.jsr223.Utils was not decoding output properly.

Fix coming. @trejkaz Can we incorporate your test case into our suite? If you have any more you'd like to add, we'd really appreciate it!

headius added a commit that referenced this issue Mar 14, 2016
* Incoming scripts should be decoded from String to byte[] using
  default internal encoding, rather than trusting JDK's
  file.encoding to be appropriate.
* Outgoing streams wrapping writers should decode strings based
  on current default internal encoding.

Fixes #2403
@headius headius self-assigned this Mar 14, 2016
@hakanai
Copy link
Author

hakanai commented Mar 14, 2016

Feel free to include that test in the JRuby suite. It will be good for us as it stops a regression creeping in.

As far as others... I just had a look through all our tests and there is nothing else about encoding in our suite, other than a second nearly identical test which I'll probably remove now that I have found it.

Thanks for the quick fix! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants