Hash symbols created from ByteList by their encoding. Fixes #1348. #4576

headius · 2017-04-27T19:16:42Z

This change appears to work and so far it's passing tests.

After digging into MRI again, I realized a couple things:

When looking up a symbol from C, using a char*, MRI has the same issue we do: they can't know what encoding you intended, so they always use US-ASCII. This is similar to our use of java.lang.String as the key in all structural tables (methods, constants, etc). Note however that MRI's structural tables are keyed off an ID, which is a symbol index that can vary depending on encoding for the same bytes.
The hashing logic to look up a Symbol from a String considers encoding, but only for strings that either have a non-ASCII-compatible encoding or that are not all 7-bit ASCII. Two strings of the same bytes but different ASCII-compatible encodings will not produce two symbols. This reduces the impact on our structural tables since almost all Ruby code uses ASCII-compatible encodings (e.g. not UTF-16), and still strings that are multibyte will almost always have differing bytes. See below:

 $ rvm ruby-2.4.0 do ruby -e 'sym1 = "cd".intern; sym2 = "cd".force_encoding("UTF-8").intern; p sym1.object_id, sym2.object_id'
14474580
14474580

Although we have a path to create symbols from java.lang.String, we also have a path to create them from ByteList. This second path is used primarily when turning a Ruby String into a Symbol. My patch fixes this path to consider encoding, which fixes the original issue.

There are some risks with this change.

The "raw" strings we use in structural tables are not currently always "raw". We are working to fix this now (@enebo's work on properly-encoded identifiers, for example).
We are not sure how good the coverage is for identifiers in mixed-encoding projects.

My gut feeling is that this can be made safe and mostly proper for a few reasons:

The case of having non-ASCII-compatible identifiers is very, very rare.
This does fix string interning to produce proper encoded symbols.
For all multibyte strings there's nearly zero chance the bytes will collide.

We'll see how this does on the rest of the test suite and evaluate how to proceed.

The problem here was that when using ObjectSpace (or object_id, or FFI) we need to insert some hidden state into objects. This state goes into the variable table, but does not get marshaled. However, if we had both reserved space for instance variables and for one of these hidden variables, we always would proceed to marshal the object as though it might have instance variables. If none of the non-hidden variables had been set, we marshaled the object as having instance variables, but of zero length. This broke several marshal specs, since objects that would otherwise have no instance variable were marshaled as such. This fix uses a second check against the variable list to ensure that there are real variables to marshal (or an encoding variable) before proceeding to emit the instance variable sigil and list.

This is not directly used other than to provide a more accurate guess as to whether there are any non-hidden variable in use by a given type. 0b6e3c0 fixes the original issue (marshal dumping zero-length instance var lists). The accessors of the lazy fields are now deprecated. Use the read or write calls to get an accessor for the hidden fields.

headius · 2017-05-03T22:59:38Z

@enebo That last commit should make this set of changes green, but I wanted to point out that Struct is another place using String/RubyString/RubySymbol/ByteList incorrectly for identifiers. I started to fix it but realized many of the methods I would have to add probably already exist in your branch (e.g. isConstant, defineConstantAt, etc).

headius · 2017-05-03T23:28:06Z

@enebo This branch is indeed green now. But I'm on the fence about including it in 9.1.9.0, even though it is limited to encoding of literal symbols and does not touch (most) identifier logic.

headius · 2017-06-13T19:01:38Z

I will push this and @enebo's bytelist_through_parser branch after testing.

headius · 2018-10-11T18:01:14Z

I have updated this for current master.

kares · 2018-10-31T07:57:23Z

nice work - esp. since all those marshal specs now pass.
... there seems to be some CI failures after the merge.

enebo

Looks like this broke some stuff in a few tests.

I only made one other suggestion otherwise.

enebo · 2018-10-31T13:53:18Z

core/src/main/java/org/jruby/RubyStruct.java

@@ -201,7 +201,16 @@ public static RubyClass newInstance(IRubyObject recv, IRubyObject[] args, Block

        for (int i = (name == null && !nilName) ? 0 : 1; i < args.length; i++) {
            if (i == args.length - 1 && args[i] instanceof RubyHash) break;
-            member.append(runtime.newSymbol(args[i].asJavaString()));
+
+            IRubyObject arg = args[i];


This is an improvement over what is there but if we can accept any type at this point then RubyString should be checked so we can retrieve bytelist vs javastring and we will maintain encoding in mbc. I half thought about making an interface for RubySymbol/RubyString like ByteListContent so we don't special check these both as separate branches (although they don't have same method name either).

Unrelated to this PR but I saw this code before and wondered about whether it is possible when we do asJavaString we convert a fixnum to "1" or stuff like that (and don't raise some error).

We would like to have the table consider encoding, but until we only use ByteList for identifiers this is not possible. We still use Java's String in many places, in which case we only have access to the raw bytes and not the encoding they should be associated with, so the String-based lookup would never be able to add this additional mask. Unfortunately removing this additional hash dimension breaks one Symbol marshal spec, but it was broken before this work as well. This was the cause of Kernel#eval regressions in jruby#4576.

Revisit symbol hash and marshal fixes from #4576

Hash symbols created from ByteList by their encoding. Fixes jruby#1348.

77be8e0

headius added core encoding JRuby 9000 labels Apr 27, 2017

headius added this to the JRuby 9.2.0.0 milestone Apr 27, 2017

headius added 8 commits April 27, 2017 18:05

Properly handle Symbol encoding in marshal streams.

c03f49a

Untag passing Marshal specs.

9027e0e

Unbreak symbol linking and fix ascii encodings.

aac3d9d

Remove more passing Marshal spec tags.

93e958b

Same fix as 0b6e3c0 but for user-marshaled objects.

ae1ef36

Properly handle multibyte symbols in Struct creation.

1d1e65a

enebo modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 24, 2018

Merge branch 'master' into sym_enc

5db2686

Merge branch 'master' into sym_enc

1a9667f

headius requested review from kares, enebo and lopex October 30, 2018 20:28

lopex approved these changes Oct 30, 2018

View reviewed changes

enebo requested changes Oct 31, 2018

View reviewed changes

headius modified the milestones: JRuby 9.2.1.0, JRuby 9.2.2.0 Nov 1, 2018

enebo removed this from the JRuby 9.2.2.0 milestone Nov 8, 2018

enebo modified the milestones: JRuby 9.2.3.0, JRuby 9.2.4.0, JRuby 9.2.5.0 Nov 8, 2018

headius mentioned this pull request Nov 20, 2018

Revisit symbol hash and marshal fixes from #4576 #5460

Merged

headius added a commit that referenced this pull request Nov 21, 2018

Merge pull request #5460 from headius/symbol_marshal_work

275fc86

Revisit symbol hash and marshal fixes from #4576

headius closed this Nov 21, 2018

headius modified the milestones: JRuby 9.2.5.0, Invalid or Duplicate Nov 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Hash symbols created from ByteList by their encoding. Fixes #1348. #4576

Hash symbols created from ByteList by their encoding. Fixes #1348. #4576

headius commented Apr 27, 2017

headius commented May 3, 2017

headius commented May 3, 2017

headius commented Jun 13, 2017

headius commented Oct 11, 2018

kares commented Oct 31, 2018

enebo left a comment

enebo Oct 31, 2018

Hash symbols created from ByteList by their encoding. Fixes #1348. #4576

Hash symbols created from ByteList by their encoding. Fixes #1348. #4576

Conversation

headius commented Apr 27, 2017

headius commented May 3, 2017

headius commented May 3, 2017

headius commented Jun 13, 2017

headius commented Oct 11, 2018

kares commented Oct 31, 2018

enebo left a comment

Choose a reason for hiding this comment

enebo Oct 31, 2018

Choose a reason for hiding this comment