Encoding of symbol literals does not respect the encoding of the source file #1328

DavidEGrayson · 2013-12-12T04:11:13Z

Hello. JRuby does not seem to respect encoding marking at the top of source files when deciding the encoding of symbol literals. In a file marked as UTF-8, all symbols get marked as US-ASCII even if they contain special characters. Here is the code that reproduces the issue:

# coding: UTF-8
puts :µ.encoding  # jruby says US-ASCII, MRI says UTF-8
puts :a.encoding  # jruby and MRI both say US-ASCII

It is dangerous to have an symbol like µ with its encoding set to US-ASCII because there are many common things you might do, like calling inspect, that result in "ArgumentError: invalid byte sequence in US-ASCII".

One workaround is to use a string literal followed by .to_sym.

Here is the output of my jruby -v:

jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM
1.7.0_07-b10 [Windows 8-amd64]

The MRI I compared this to is: "ruby 2.0.0p0 (2013-02-24) [x64-mingw32]".

Sorry if this is a duplicate; I did check every closed or open github issue tagged with "encoding" trying to avoid that.

The text was updated successfully, but these errors were encountered:

k77ch7 · 2014-12-21T13:33:17Z

This works on master and jruby-1_7. #2172 and #1685 are duplicate of this one.

DavidEGrayson · 2014-12-21T20:11:41Z

Those bug reports got it wrong: the problem only affected symbol literals, not all symbols.

enebo · 2014-12-22T14:50:23Z

Are you saying we still have this wrong or that just the reports were wrong?

-Tom

On Sun, Dec 21, 2014 at 2:11 PM, David Grayson notifications@github.com
wrote:

Those bug reports got it wrong: the problem only affected symbol literals,
not all symbols.

—
Reply to this email directly or view it on GitHub
#1328 (comment).

blog: http://blog.enebo.com twitter: tom_enebo
mail: tom.enebo@gmail.com

DavidEGrayson · 2014-12-22T16:36:32Z

I was just talking about the other bug reports. I haven't tested jruby recently regarding this.

enebo · 2014-12-22T16:37:55Z

Cool. Thanks for the clarification. This last round fixed all remaining
rubyspec failures, so if we are not passing something I don't think neither
MRI nor rubyspec has coverage for it.

-Tom

On Mon, Dec 22, 2014 at 10:36 AM, David Grayson notifications@github.com
wrote:

I was just talking about the other bug reports. I haven't tested jruby
recently regarding this.

—
Reply to this email directly or view it on GitHub
#1328 (comment).

blog: http://blog.enebo.com twitter: tom_enebo
mail: tom.enebo@gmail.com

DavidEGrayson · 2015-01-22T17:59:25Z

I tested the script from my original post on JRuby 9.0.0.0-pre1 and I got a more serious error:

$ jruby -v sym_enc.rb
jruby 9.0.0.0.pre1 (2.2.0p0) 2015-01-20 d537cab Java HotSpot(TM) 64-Bit Server VM 24.45-b08 on 1.7.0
_45-b18 +jit [Windows 8-amd64]
io/console not supported; tty will not be manipulated
UTF8Encoding.java:35:in `length': java.lang.ArrayIndexOutOfBoundsException: -1
        from MultiByteEncoding.java:209:in `strLength'
        from ByteList.java:593:in `lengthEnc'
        from SymbolNode.java:71:in `<init>'
        from RubyParser.java:4224:in `execute'
        from RubyParser.java:1647:in `yyparse'
        from RubyParser.java:1538:in `yyparse'
        from RubyParser.java:5248:in `parse'
        from Parser.java:108:in `parse'
        from Parser.java:89:in `parse'
        from Ruby.java:2691:in `parseFileAndGetAST'
        from Ruby.java:2684:in `parseFileFromMainAndGetAST'
        from Ruby.java:2672:in `parseFileFromMain'
        from Ruby.java:602:in `parseFromMain'
        from Ruby.java:548:in `runFromMain'
        from Main.java:405:in `doRunFromMain'
        from Main.java:300:in `internalRun'
        from Main.java:227:in `run'
        from Main.java:199:in `main'

To help people reproduce the issue, here is what my hex editor (WinHex) shows for sym_enc.rb:

Offset      0  1  2  3  4  5  6  7   8  9  A  B  C  D  E  F

00000000   23 20 63 6F 64 69 6E 67  3A 20 55 54 46 2D 38 0D   # coding: UTF-8 
00000010   0A 70 75 74 73 20 3A C2  B5 2E 65 6E 63 6F 64 69    puts :Âµ.encodi
00000020   6E 67 0D 0A 70 75 74 73  20 3A 61 2E 65 6E 63 6F   ng  puts :a.enco
00000030   64 69 6E 67 0D 0A                                  ding

DavidEGrayson · 2015-01-22T20:42:23Z

I put the script into a gist because headius asked me to on IRC: https://gist.github.com/DavidEGrayson/5a03462b705f6d96e2fe

cheald · 2015-01-22T21:25:16Z

I couldn't reproduce this on Fedora, and @headius couldn't on OS X, but I did successfully reproduce it on Windows 8.1 with pre1. Yay for platform bugs!

DavidEGrayson · 2015-01-22T21:45:44Z

If I run this slightly modified script, JRuby seems to get stuck in an infinite loop:

# coding: UTF-8
puts :aµ.encoding
puts :a.encoding

If you run the issue described in jruby#1328 with -J-Dfile.encoding=CP1252 then the string length comparison fails. This smoothes over the issue, but it is still possible that there are problems with bytes being read with the JVM's file.encoding rather than the encoding specified in the magic comment. Closes jruby#1328

DavidEGrayson · 2015-07-24T22:24:03Z

I just tested this in JRuby 9.0.0.0. Unfortunately, JRuby still gets stuck in an infinite loop when I run the script posted in my last comment.

enebo · 2015-07-27T15:03:30Z

@DavidEGrayson Thanks for the update. This bug involves Windows having a file.encoding=cp1252 (or potentially another one). I can get an infinite loop as well with your script on MacOS if I invoke it as:

jruby -J-Dfile.encoding=CP1252 ../snippets/parser19.rb

Clearly, we are doing something wrong. If you want to work around this until we fix this you can pass -J-Dfile.encoding=UTF-8.

enebo · 2015-08-21T19:50:04Z

For posterity I will write a note on this especially since I think we can make a better fix later (which was too invasive to consider now). All identifiers are created as Java intern'd Strings. This is also how symbols are made (token ':' + tIDENT -- more cases but you get the idea). When we make SymbolNodes we called getBytes() on that Java String which would return bytes as presented via Java file.encoding property. This led to bogus byte sequences which did not map to the encoding Ruby thought the symbol was.

The better fix long-term would be to puts bytelist for all identifiers and then for values actually destined to be things like method names or lvars as intern'd strings. The problem with this long term fix is constructing a ByteList per real identifier and it also means more complicated logic in the parser to convert as appropriate.

In thinking about this I also thought about an extra field on RubyLexer. The problem with this is for productions like label where we have tLABEL expr_value where expr_value could have an embedded identifier.

DavidEGrayson · 2015-08-22T04:13:10Z

Thanks @enebo! I assume your commit fixes the infinite loop and the original problem with symbol literal encoding.

enebo · 2015-08-22T15:46:35Z

@DavidEGrayson yeah I did check that case as an extra sanity check but the reason for the infinite loop was we were taking CP1252 bytes and trying to calculate character length as if they were a UTF-8 set of bytes. I am not sure why jcodings can get stuck in an infinite loop in that case but that case should never happen.

DavidEGrayson mentioned this issue Dec 12, 2013

Unmarshaled symbol has the wrong encoding #1329

Closed

DavidEGrayson closed this as completed Jan 22, 2015

DavidEGrayson reopened this Jan 22, 2015

cheald mentioned this issue Jan 23, 2015

Fix ascii-only detection of symbol names. #2504

Closed

enebo added this to the JRuby 9.0.1.0 milestone Jul 27, 2015

enebo added parser JRuby 9000 labels Jul 27, 2015

enebo closed this as completed in e590e5d Aug 21, 2015

enebo mentioned this issue Aug 24, 2015

Encoding of symbol literals does not respect the encoding of the source file (1.9 edition) #3280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Encoding of symbol literals does not respect the encoding of the source file #1328

Encoding of symbol literals does not respect the encoding of the source file #1328

DavidEGrayson commented Dec 12, 2013

k77ch7 commented Dec 21, 2014

DavidEGrayson commented Dec 21, 2014

enebo commented Dec 22, 2014

DavidEGrayson commented Dec 22, 2014

enebo commented Dec 22, 2014

DavidEGrayson commented Jan 22, 2015

DavidEGrayson commented Jan 22, 2015

cheald commented Jan 22, 2015

DavidEGrayson commented Jan 22, 2015

DavidEGrayson commented Jul 24, 2015

enebo commented Jul 27, 2015

enebo commented Aug 21, 2015

DavidEGrayson commented Aug 22, 2015

enebo commented Aug 22, 2015

Encoding of symbol literals does not respect the encoding of the source file #1328

Encoding of symbol literals does not respect the encoding of the source file #1328

Comments

DavidEGrayson commented Dec 12, 2013

k77ch7 commented Dec 21, 2014

DavidEGrayson commented Dec 21, 2014

enebo commented Dec 22, 2014

DavidEGrayson commented Dec 22, 2014

enebo commented Dec 22, 2014

DavidEGrayson commented Jan 22, 2015

DavidEGrayson commented Jan 22, 2015

cheald commented Jan 22, 2015

DavidEGrayson commented Jan 22, 2015

DavidEGrayson commented Jul 24, 2015

enebo commented Jul 27, 2015

enebo commented Aug 21, 2015

DavidEGrayson commented Aug 22, 2015

enebo commented Aug 22, 2015