-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding of symbol literals does not respect the encoding of the source file #1328
Comments
Those bug reports got it wrong: the problem only affected symbol literals, not all symbols. |
Are you saying we still have this wrong or that just the reports were wrong? -Tom On Sun, Dec 21, 2014 at 2:11 PM, David Grayson notifications@github.com
blog: http://blog.enebo.com twitter: tom_enebo |
I was just talking about the other bug reports. I haven't tested jruby recently regarding this. |
Cool. Thanks for the clarification. This last round fixed all remaining -Tom On Mon, Dec 22, 2014 at 10:36 AM, David Grayson notifications@github.com
blog: http://blog.enebo.com twitter: tom_enebo |
I tested the script from my original post on JRuby 9.0.0.0-pre1 and I got a more serious error:
To help people reproduce the issue, here is what my hex editor (WinHex) shows for sym_enc.rb:
|
I put the script into a gist because headius asked me to on IRC: https://gist.github.com/DavidEGrayson/5a03462b705f6d96e2fe |
I couldn't reproduce this on Fedora, and @headius couldn't on OS X, but I did successfully reproduce it on Windows 8.1 with pre1. Yay for platform bugs! |
If I run this slightly modified script, JRuby seems to get stuck in an infinite loop:
|
If you run the issue described in jruby#1328 with -J-Dfile.encoding=CP1252 then the string length comparison fails. This smoothes over the issue, but it is still possible that there are problems with bytes being read with the JVM's file.encoding rather than the encoding specified in the magic comment. Closes jruby#1328
I just tested this in JRuby 9.0.0.0. Unfortunately, JRuby still gets stuck in an infinite loop when I run the script posted in my last comment. |
@DavidEGrayson Thanks for the update. This bug involves Windows having a file.encoding=cp1252 (or potentially another one). I can get an infinite loop as well with your script on MacOS if I invoke it as:
Clearly, we are doing something wrong. If you want to work around this until we fix this you can pass -J-Dfile.encoding=UTF-8. |
For posterity I will write a note on this especially since I think we can make a better fix later (which was too invasive to consider now). All identifiers are created as Java intern'd Strings. This is also how symbols are made (token ':' + tIDENT -- more cases but you get the idea). When we make SymbolNodes we called getBytes() on that Java String which would return bytes as presented via Java file.encoding property. This led to bogus byte sequences which did not map to the encoding Ruby thought the symbol was. The better fix long-term would be to puts bytelist for all identifiers and then for values actually destined to be things like method names or lvars as intern'd strings. The problem with this long term fix is constructing a ByteList per real identifier and it also means more complicated logic in the parser to convert as appropriate. In thinking about this I also thought about an extra field on RubyLexer. The problem with this is for productions like label where we have tLABEL expr_value where expr_value could have an embedded identifier. |
Thanks @enebo! I assume your commit fixes the infinite loop and the original problem with symbol literal encoding. |
@DavidEGrayson yeah I did check that case as an extra sanity check but the reason for the infinite loop was we were taking CP1252 bytes and trying to calculate character length as if they were a UTF-8 set of bytes. I am not sure why jcodings can get stuck in an infinite loop in that case but that case should never happen. |
Hello. JRuby does not seem to respect encoding marking at the top of source files when deciding the encoding of symbol literals. In a file marked as UTF-8, all symbols get marked as US-ASCII even if they contain special characters. Here is the code that reproduces the issue:
It is dangerous to have an symbol like µ with its encoding set to US-ASCII because there are many common things you might do, like calling
inspect
, that result in "ArgumentError: invalid byte sequence in US-ASCII".One workaround is to use a string literal followed by
.to_sym
.Here is the output of my
jruby -v
:The MRI I compared this to is: "ruby 2.0.0p0 (2013-02-24) [x64-mingw32]".
Sorry if this is a duplicate; I did check every closed or open github issue tagged with "encoding" trying to avoid that.
The text was updated successfully, but these errors were encountered: