You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the move to M17N in Ruby 1.9, it became possible to store variables, constants, etc with arbitrary encodings.
In JRuby, we have never fully supported this because all our identifier-keyed tables (method table, constant table, etc) use a Java String, and traditionally used a properly decoded string. This works fine when all identifiers are the same encoding, but breaks if different encodings are used (since we lose the original when going to a UTF-16 String).
In order to support this better, we attempted to represent our identifiers like MRI represents its IDs: as the raw bytes of whatever parsed identifier came in. This allows uniquely referencing a given symbol given just its raw bytes, provided the symbol is still alive. What we didn't do is propagate raw bytes throughout all identifier-related APIs; only some of them actually use the raw string, while others still use fully-decoded strings as characters.
If we wish to fix this, we can't do it part way. This leads to API conflicts that are hard or impossible to resolve.
There are two paths forward, as I see them:
Complete the transition to an ID-like system where every identifier can be properly converted back into an original byte[]+encoding tuple. I started this process in f7f5417 in the new_ids branch. This is a very large effort and may need to wait until a "JRuby 10k" given the wide-reaching API breakage that will result. It may never be feasible.
Accept that we will only ever be able to represent identifiers as UTF-16 and make that explicit. Use UTF-16 throughout all identifier APIs. In this approach, I'm not sure what would happen to symbols, since people do depend on them preserving their encoding, and then use those encoded symbols as identifiers.
Neither approach is really great.
The text was updated successfully, but these errors were encountered:
With the move to M17N in Ruby 1.9, it became possible to store variables, constants, etc with arbitrary encodings.
In JRuby, we have never fully supported this because all our identifier-keyed tables (method table, constant table, etc) use a Java String, and traditionally used a properly decoded string. This works fine when all identifiers are the same encoding, but breaks if different encodings are used (since we lose the original when going to a UTF-16 String).
In order to support this better, we attempted to represent our identifiers like MRI represents its IDs: as the raw bytes of whatever parsed identifier came in. This allows uniquely referencing a given symbol given just its raw bytes, provided the symbol is still alive. What we didn't do is propagate raw bytes throughout all identifier-related APIs; only some of them actually use the raw string, while others still use fully-decoded strings as characters.
If we wish to fix this, we can't do it part way. This leads to API conflicts that are hard or impossible to resolve.
There are two paths forward, as I see them:
new_ids
branch. This is a very large effort and may need to wait until a "JRuby 10k" given the wide-reaching API breakage that will result. It may never be feasible.Neither approach is really great.
The text was updated successfully, but these errors were encountered: