double-quoted UTF8 hash key has the wrong encoding #2591

k77ch7 · 2015-02-12T14:27:37Z

The following test.rb works on MRI Ruby 2.2.0,
but gives the wrong encoding and ArgumentError on master.

test.rb

# encoding: utf-8
h = { "Ãa1":  true }
puts h.keys.first.encoding
puts h

MRI Ruby 2.2.0

$ ruby ~/development/test.rb
UTF-8
{:Ãa1=>true}

JRuby master

$ jruby ~/development/test.rb
US-ASCII
ArgumentError: invalid byte sequence in US-ASCII
     inspect at org/jruby/RubySymbol.java:241
     inspect at org/jruby/RubySymbol.java:230
     inspect at org/jruby/RubyHash.java:844
        to_s at org/jruby/RubyHash.java:909
        puts at org/jruby/RubyIO.java:2406
        puts at org/jruby/RubyKernel.java:542
  __script__ at /Users/k77ch7/test.rb:4

My env

jruby 9.0.0.0-SNAPSHOT (2.2.0p0) 2015-02-12 1c1838e Java HotSpot(TM) 64-Bit Server VM 25.5-b02 on 1.8.0_05-b13 +jit [darwin-x86_64]

The text was updated successfully, but these errors were encountered:

Kagetsuki · 2015-03-01T07:03:06Z

In order to not open a new issue I'll add here, but I've just found out jRuby is basically not meeting m17n compliance like Ruby 1.9+ does. Note:
http://rosettacode.org/wiki/Unicode_variable_names#Ruby

I have some code that needs to have hash keys that are UTF-8 and would love to run it on jRuby but it's a no go. Both 9k and 1.7.19 have the same issue and specifying code pages and language for Java environment fail to remedy the issue.

I'm pretty sure the root of the issue is both a mix of failure to recognize the nature of the issue (we're NOT talking about UTF-8 strings held in variables or file system paths) and perhaps unfamiliarity with the fact mainline Ruby supports UTF-8 variable/method/hash key/symbols etc. However, why the problem exists in the first place is likely due to how Java imposes the system/environment code set on whatever is being run.

headius · 2015-03-12T22:04:49Z

This seems to be a parser issue and @enebo will probably have to tackle it. Would be nice to fix in 1.7.

@enebo: It seems like the encodings are not being handled properly for these new style keys :-(

Kagetsuki · 2015-03-13T05:23:45Z

@headius and @enebo : I actually just happened to be playing around with some encoding issues a few days ago - so good timing to post @headius because I think I further understand the issue now and can say with a much higher certainty that the root of the issue is very likely linked to the actual Java runtime more than jRuby itself. Though that isn't to say that jRuby is meeting m17n compliance for UTF-8 in actual code, because it isn't; but without figuring out how to force the JVM/runtime/whatever to use UTF-8 by default it will be hard to progress further.

From my investigation that isn't actually something simple. Apparently once Java starts running something telling it to change its code page is not only non-trivial, but because it's more of a system level operation and there is no clarification of how this should be standardized it's different with every Java implementation. To make matters worse it looks like different Java implementations actually handle UTF-8 slightly differently (EG apparently dalvik will actually give different results for some string comparisons of UTF-8 strings than other Java implementations...).

@enebo may have a few headaches waiting for him when he decides to tackle this. To be honest I have a feeling this could require way more work than anyone is expecting. I can only hope I'm wrong.

Kagetsuki · 2015-03-13T05:41:01Z

Also, just in case anyone wants to try and make a an argument like "but you shouldn't be using UTF-8 hash keys anyway!": there's a whole lot of JSON out there using UTF-8 strings as keys - try parsing that. YEAH, THAT'S WHAT I THOUGHT!

As for UTF-8 method names I can think of two examples where I've seen this that were nifty. My favourite is the math symbols:

⊕⊖⊙⊘Σ

Not "necessary" but if we've got UTF-8 symbols then is there really any added cost to having UTF-8 compatible methods too?

enebo · 2015-03-16T15:34:41Z

I unmarked 1.7.x since I think this is a Ruby 2.1+ syntax change.

k77ch7 · 2015-03-17T13:24:48Z

Hash#inspect differs from MRI Ruby 2.2.0. I will open new issue.

k77ch7 changed the title ~~utf-8 double-quoted hash key has the wrong encoding~~ double-quoted UTF8 hash key has the wrong encoding Feb 12, 2015

headius added parser JRuby 1.7.x JRuby 9000 labels Mar 12, 2015

k77ch7 added a commit to k77ch7/jruby that referenced this issue Mar 13, 2015

fix jrubyGH-2591 on master. keeps encoding of symbol name.

407337e

k77ch7 mentioned this issue Mar 13, 2015

Fix for issue 2591 on master : double-quoted UTF8 hash key encoding #2692

Merged

enebo removed the JRuby 1.7.x label Mar 16, 2015

enebo closed this as completed in 0825e69 Mar 16, 2015

enebo added this to the 9.0.0.0.pre2 milestone Apr 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

double-quoted UTF8 hash key has the wrong encoding #2591

double-quoted UTF8 hash key has the wrong encoding #2591

k77ch7 commented Feb 12, 2015

Kagetsuki commented Mar 1, 2015

headius commented Mar 12, 2015

Kagetsuki commented Mar 13, 2015

Kagetsuki commented Mar 13, 2015

enebo commented Mar 16, 2015

k77ch7 commented Mar 17, 2015

double-quoted UTF8 hash key has the wrong encoding #2591

double-quoted UTF8 hash key has the wrong encoding #2591

Comments

k77ch7 commented Feb 12, 2015

Kagetsuki commented Mar 1, 2015

headius commented Mar 12, 2015

Kagetsuki commented Mar 13, 2015

Kagetsuki commented Mar 13, 2015

enebo commented Mar 16, 2015

k77ch7 commented Mar 17, 2015