Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII Symbol gives ArgumentError when calling inspect on the symbol #4070

Closed
donv opened this issue Aug 12, 2016 · 12 comments
Closed

Non-ASCII Symbol gives ArgumentError when calling inspect on the symbol #4070

donv opened this issue Aug 12, 2016 · 12 comments

Comments

@donv
Copy link
Member

donv commented Aug 12, 2016

Environment

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-08-12 93bd82f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ uname -a
Darwin macbeth-3.local 14.5.0 Darwin Kernel Version 14.5.0: Thu Jun 16 19:58:21 PDT 2016; root:xnu-2782.50.4~1/RELEASE_X86_64 x86_64

Expected Behavior

$ ruby -v
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin14]
$ irb
irb(main):001:0> :Renè
=> :Renè

Actual Behavior

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-08-12 93bd82f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ irb
irb(main):002:0> :René
ArgumentError: invalid byte sequence in UTF-8
from org/jruby/RubySymbol.java:274:in inspect' from org/jruby/RubySymbol.java:259:ininspect'
from org/jruby/RubyKernel.java:1295:in loop' from org/jruby/RubyKernel.java:1114:incatch'
from org/jruby/RubyKernel.java:1114:in catch' from /Users/uwe/.rubies/jruby-9.1.3.0-snapshot/bin/irb:13:in

'

@enebo
Copy link
Member

enebo commented Aug 12, 2016

This appears to be specific to irb itself. If I make a symbol in a file or via -e then :Renè parses fine.

@phluid61
Copy link
Contributor

phluid61 commented Aug 19, 2016

It's an error in Symbol#inspect

$ bin/jruby -v -e 'p :Renè'
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-07-07 6600d9f OpenJDK 64-Bit Server VM 25.91-b14 on 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14 +jit [linux-x86_64]
ArgumentError: invalid byte sequence in UTF-8
  inspect at org/jruby/RubySymbol.java:274
  inspect at org/jruby/RubySymbol.java:259
        p at org/jruby/RubyKernel.java:476
   <main> at -e:1

@phluid61
Copy link
Contributor

Actually, at some point preciseLength() in utils/StringSupport.java is returning -3, so codePoint() is throwing the argument error, which bubbles up via isPrintable() in RubySymbol.java

The root cause seems to be in the org.jcodings.Encoding child's length(), but I don't have time right now to keep digging.

@enebo
Copy link
Member

enebo commented Aug 19, 2016

Following up on @phluid61 I see that the bytelist for Renè only contains a single byte for è and the -3 represents -1-missing or that we are missing 2 bytes. So we definitely are storing something wrong as byte data for symbols. Thanks for figuring this is just inspect doing the wrong thing. In retrospect it would make sense this would be why irb was unhappy.

@donv
Copy link
Member Author

donv commented Aug 19, 2016

@enebo @phluid61 Maybe I misunderstood what you are saying, but this definitely happens outside of IRB as well. We have this pop up in our applications. Our workaround is to quote the symbols. Quoted symbols work as expected:

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.1) 2016-08-19 6600d9f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ ruby -e "puts :Renè"
Ren?
$ ruby -e "puts :'Renè'"
Renè

@phluid61
Copy link
Contributor

@enebo I see the problem, the raw symbol :Renè has bytes 52 65 6E E8, which corresponds with the ISO-8859-1 encoding of "Renè"; however the symbol believes it is encoded in UTF-8.

When the symbol is quoted (:"Renè" or :'Renè') it has the correct UTF-8 bytes 52 65 6E C3 AE, so the call to #inspect works properly.

Again, I don't have time to dig further into it right now.

@phluid61
Copy link
Contributor

Okay, in org.jruby.RubySymbol.newSymbol(Ruby, String, Encoding) the newSymbol's String and ByteList objects don't match.

In org.jruby.RubySymbol:

  • newSymbol(Ruby, String, Encoding) calls...
  • newSymbol(Ruby, String) calls...
  • SymbolTable.getSymbol(String, boolean false) calls...
  • symbolBytesFromString(Ruby, String) calls...
  • new ByteList(ByteList.plain(internedSymbol), USASCIIEncoding.INSTANCE, false);

In that final line, ByteList.plain is essentially just return encode(s, "ISO-8859-1");

After that call stack, newSymbol() calls newSymbol.associateEncoding(encoding), which directly sets the encoding of the ByteList object to UTF-8. So it holds ISO-8859-1 bytes, but it thinks its encoding is UTF-8.

Not sure what the appropriate fix is.

@enebo
Copy link
Member

enebo commented Aug 20, 2016

@phluid61 I think this is running into what we need to do but haven't. We have made some stuff work but it is inconsistent and the broken way is the way which allows symbols to work in some cases. See: #3880 (comment)

@enebo
Copy link
Member

enebo commented Aug 20, 2016

I should add to that comment by saying properly encoded values are not just strictly for display purposes but are also needed in cases which will cross a resource gap like to a native extension or to a Java type via our Java Integration.

brocktimus added a commit to brocktimus/jruby that referenced this issue Aug 21, 2016
@enebo enebo changed the title Non-ASCII Symbol gives ArgumentError: invalid byte sequence in UTF-8 Non-ASCII Symbol gives ArgumentError when calling inspect on the symbol Aug 22, 2016
@enebo enebo modified the milestones: JRuby 9.1.4.0, JRuby 9.1.3.0 Aug 22, 2016
@enebo
Copy link
Member

enebo commented Aug 22, 2016

Too risky before 9.1.3.0 but this along with #3880 should be one of the first things we do for 9.1.4.0 so it can bake.

@enebo enebo removed this from the JRuby 9.1.5.0 milestone Sep 7, 2016
@enebo enebo modified the milestones: JRuby 9.1.7.0, JRuby 9.1.6.0 Nov 7, 2016
@enebo enebo modified the milestones: JRuby 9.2.0.0, JRuby 9.1.7.0 Dec 20, 2016
@headius
Copy link
Member

headius commented Apr 27, 2017

Fixed, likely by changes for #4564.

@headius headius closed this as completed Apr 27, 2017
@headius headius modified the milestones: JRuby 9.1.9.0, JRuby 9.2.0.0 Apr 27, 2017
@donv
Copy link
Member Author

donv commented May 6, 2017

Confirmed fixed in my application!

kares added a commit that referenced this issue Jan 10, 2018
Extra tests for symbol encoding Re: #4070 + #3719 and possibly #3880
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants