Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double-quoted UTF8 hash key has the wrong encoding #2591

Closed
k77ch7 opened this issue Feb 12, 2015 · 6 comments
Closed

double-quoted UTF8 hash key has the wrong encoding #2591

k77ch7 opened this issue Feb 12, 2015 · 6 comments

Comments

@k77ch7
Copy link
Contributor

k77ch7 commented Feb 12, 2015

The following test.rb works on MRI Ruby 2.2.0,
but gives the wrong encoding and ArgumentError on master.

test.rb

# encoding: utf-8
h = { "Ãa1":  true }
puts h.keys.first.encoding
puts h

MRI Ruby 2.2.0

$ ruby ~/development/test.rb
UTF-8
{:Ãa1=>true}

JRuby master

$ jruby ~/development/test.rb
US-ASCII
ArgumentError: invalid byte sequence in US-ASCII
     inspect at org/jruby/RubySymbol.java:241
     inspect at org/jruby/RubySymbol.java:230
     inspect at org/jruby/RubyHash.java:844
        to_s at org/jruby/RubyHash.java:909
        puts at org/jruby/RubyIO.java:2406
        puts at org/jruby/RubyKernel.java:542
  __script__ at /Users/k77ch7/test.rb:4

My env

jruby 9.0.0.0-SNAPSHOT (2.2.0p0) 2015-02-12 1c1838e Java HotSpot(TM) 64-Bit Server VM 25.5-b02 on 1.8.0_05-b13 +jit [darwin-x86_64]
@k77ch7 k77ch7 changed the title utf-8 double-quoted hash key has the wrong encoding double-quoted UTF8 hash key has the wrong encoding Feb 12, 2015
@Kagetsuki
Copy link

In order to not open a new issue I'll add here, but I've just found out jRuby is basically not meeting m17n compliance like Ruby 1.9+ does. Note:
http://rosettacode.org/wiki/Unicode_variable_names#Ruby

I have some code that needs to have hash keys that are UTF-8 and would love to run it on jRuby but it's a no go. Both 9k and 1.7.19 have the same issue and specifying code pages and language for Java environment fail to remedy the issue.

I'm pretty sure the root of the issue is both a mix of failure to recognize the nature of the issue (we're NOT talking about UTF-8 strings held in variables or file system paths) and perhaps unfamiliarity with the fact mainline Ruby supports UTF-8 variable/method/hash key/symbols etc. However, why the problem exists in the first place is likely due to how Java imposes the system/environment code set on whatever is being run.

@headius
Copy link
Member

headius commented Mar 12, 2015

This seems to be a parser issue and @enebo will probably have to tackle it. Would be nice to fix in 1.7.

@enebo: It seems like the encodings are not being handled properly for these new style keys :-(

@Kagetsuki
Copy link

@headius and @enebo : I actually just happened to be playing around with some encoding issues a few days ago - so good timing to post @headius because I think I further understand the issue now and can say with a much higher certainty that the root of the issue is very likely linked to the actual Java runtime more than jRuby itself. Though that isn't to say that jRuby is meeting m17n compliance for UTF-8 in actual code, because it isn't; but without figuring out how to force the JVM/runtime/whatever to use UTF-8 by default it will be hard to progress further.

From my investigation that isn't actually something simple. Apparently once Java starts running something telling it to change its code page is not only non-trivial, but because it's more of a system level operation and there is no clarification of how this should be standardized it's different with every Java implementation. To make matters worse it looks like different Java implementations actually handle UTF-8 slightly differently (EG apparently dalvik will actually give different results for some string comparisons of UTF-8 strings than other Java implementations...).

@enebo may have a few headaches waiting for him when he decides to tackle this. To be honest I have a feeling this could require way more work than anyone is expecting. I can only hope I'm wrong.

@Kagetsuki
Copy link

Also, just in case anyone wants to try and make a an argument like "but you shouldn't be using UTF-8 hash keys anyway!": there's a whole lot of JSON out there using UTF-8 strings as keys - try parsing that. YEAH, THAT'S WHAT I THOUGHT!

As for UTF-8 method names I can think of two examples where I've seen this that were nifty. My favourite is the math symbols:

⊕⊖⊙⊘Σ

Not "necessary" but if we've got UTF-8 symbols then is there really any added cost to having UTF-8 compatible methods too?

@enebo
Copy link
Member

enebo commented Mar 16, 2015

I unmarked 1.7.x since I think this is a Ruby 2.1+ syntax change.

@enebo enebo closed this as completed in 0825e69 Mar 16, 2015
@k77ch7
Copy link
Contributor Author

k77ch7 commented Mar 17, 2015

Hash#inspect differs from MRI Ruby 2.2.0. I will open new issue.

@enebo enebo added this to the 9.0.0.0.pre2 milestone Apr 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants