Identifier-keyed tables must always use raw or always use encoded identifiers #3697

headius · 2016-02-24T16:51:09Z

With the move to M17N in Ruby 1.9, it became possible to store variables, constants, etc with arbitrary encodings.

In JRuby, we have never fully supported this because all our identifier-keyed tables (method table, constant table, etc) use a Java String, and traditionally used a properly decoded string. This works fine when all identifiers are the same encoding, but breaks if different encodings are used (since we lose the original when going to a UTF-16 String).

In order to support this better, we attempted to represent our identifiers like MRI represents its IDs: as the raw bytes of whatever parsed identifier came in. This allows uniquely referencing a given symbol given just its raw bytes, provided the symbol is still alive. What we didn't do is propagate raw bytes throughout all identifier-related APIs; only some of them actually use the raw string, while others still use fully-decoded strings as characters.

If we wish to fix this, we can't do it part way. This leads to API conflicts that are hard or impossible to resolve.

There are two paths forward, as I see them:

Complete the transition to an ID-like system where every identifier can be properly converted back into an original byte[]+encoding tuple. I started this process in f7f5417 in the new_ids branch. This is a very large effort and may need to wait until a "JRuby 10k" given the wide-reaching API breakage that will result. It may never be feasible.
Accept that we will only ever be able to represent identifiers as UTF-16 and make that explicit. Use UTF-16 throughout all identifier APIs. In this approach, I'm not sure what would happen to symbols, since people do depend on them preserving their encoding, and then use those encoded symbols as identifiers.

Neither approach is really great.

The text was updated successfully, but these errors were encountered:

This works properly, but because it uses a "raw" string the resulting error message is mangled when MBC are present. See #3697.

headius · 2017-04-27T17:49:24Z

Likely to be fixed by identifier work by me and @enebo for 9.2.

headius · 2018-05-16T18:29:38Z

Largely fixed by @enebo's symbol work.

headius added a commit that referenced this issue Feb 24, 2016

Fix str2sym behavior in Module#remove_method.

c3dd1f7

This works properly, but because it uses a "raw" string the resulting error message is mangled when MBC are present. See #3697.

headius mentioned this issue May 13, 2016

define_method using symbols string syntax works incorrectly #3880

Closed

headius added this to the JRuby 9.2.0.0 milestone Apr 27, 2017

headius added core encoding JRuby 9000 labels Apr 27, 2017

enebo mentioned this issue Feb 27, 2018

Properly represent all encodings in methods/constants/variables/symbols #4965

Closed

headius closed this as completed May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Identifier-keyed tables must always use raw or always use encoded identifiers #3697

Identifier-keyed tables must always use raw or always use encoded identifiers #3697

headius commented Feb 24, 2016

headius commented Apr 27, 2017

headius commented May 16, 2018

Identifier-keyed tables must always use raw or always use encoded identifiers #3697

Identifier-keyed tables must always use raw or always use encoded identifiers #3697

Comments

headius commented Feb 24, 2016

headius commented Apr 27, 2017

headius commented May 16, 2018