Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash symbols created from ByteList by their encoding. Fixes #1348. #4576

Closed
wants to merge 11 commits into from

Conversation

headius
Copy link
Member

@headius headius commented Apr 27, 2017

This change appears to work and so far it's passing tests.

After digging into MRI again, I realized a couple things:

  • When looking up a symbol from C, using a char*, MRI has the same issue we do: they can't know what encoding you intended, so they always use US-ASCII. This is similar to our use of java.lang.String as the key in all structural tables (methods, constants, etc). Note however that MRI's structural tables are keyed off an ID, which is a symbol index that can vary depending on encoding for the same bytes.

  • The hashing logic to look up a Symbol from a String considers encoding, but only for strings that either have a non-ASCII-compatible encoding or that are not all 7-bit ASCII. Two strings of the same bytes but different ASCII-compatible encodings will not produce two symbols. This reduces the impact on our structural tables since almost all Ruby code uses ASCII-compatible encodings (e.g. not UTF-16), and still strings that are multibyte will almost always have differing bytes. See below:

 $ rvm ruby-2.4.0 do ruby -e 'sym1 = "cd".intern; sym2 = "cd".force_encoding("UTF-8").intern; p sym1.object_id, sym2.object_id'
14474580
14474580
  • Although we have a path to create symbols from java.lang.String, we also have a path to create them from ByteList. This second path is used primarily when turning a Ruby String into a Symbol. My patch fixes this path to consider encoding, which fixes the original issue.

There are some risks with this change.

  • The "raw" strings we use in structural tables are not currently always "raw". We are working to fix this now (@enebo's work on properly-encoded identifiers, for example).
  • We are not sure how good the coverage is for identifiers in mixed-encoding projects.

My gut feeling is that this can be made safe and mostly proper for a few reasons:

  • The case of having non-ASCII-compatible identifiers is very, very rare.
  • This does fix string interning to produce proper encoded symbols.
  • For all multibyte strings there's nearly zero chance the bytes will collide.

We'll see how this does on the rest of the test suite and evaluate how to proceed.

headius added 8 commits April 27, 2017 18:05

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
The problem here was that when using ObjectSpace (or object_id,
or FFI) we need to insert some hidden state into objects. This
state goes into the variable table, but does not get marshaled.
However, if we had both reserved space for instance variables and
for one of these hidden variables, we always would proceed to
marshal the object as though it might have instance variables. If
none of the non-hidden variables had been set, we marshaled the
object as having instance variables, but of zero length. This
broke several marshal specs, since objects that would otherwise
have no instance variable were marshaled as such.

This fix uses a second check against the variable list to ensure
that there are real variables to marshal (or an encoding variable)
before proceeding to emit the instance variable sigil and list.
This is not directly used other than to provide a more accurate
guess as to whether there are any non-hidden variable in use by
a given type. 0b6e3c0 fixes the original issue (marshal dumping
zero-length instance var lists).

The accessors of the lazy fields are now deprecated. Use the read
or write calls to get an accessor for the hidden fields.
@headius
Copy link
Member Author

headius commented May 3, 2017

@enebo That last commit should make this set of changes green, but I wanted to point out that Struct is another place using String/RubyString/RubySymbol/ByteList incorrectly for identifiers. I started to fix it but realized many of the methods I would have to add probably already exist in your branch (e.g. isConstant, defineConstantAt, etc).

@headius
Copy link
Member Author

headius commented May 3, 2017

@enebo This branch is indeed green now. But I'm on the fence about including it in 9.1.9.0, even though it is limited to encoding of literal symbols and does not touch (most) identifier logic.

@headius
Copy link
Member Author

headius commented Jun 13, 2017

I will push this and @enebo's bytelist_through_parser branch after testing.

@enebo enebo modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 24, 2018
@headius
Copy link
Member Author

headius commented Oct 11, 2018

I have updated this for current master.

@headius headius requested review from kares, enebo and lopex October 30, 2018 20:28
@kares
Copy link
Member

kares commented Oct 31, 2018

nice work - esp. since all those marshal specs now pass.
... there seems to be some CI failures after the merge.

Copy link
Member

@enebo enebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this broke some stuff in a few tests.

I only made one other suggestion otherwise.

@@ -201,7 +201,16 @@ public static RubyClass newInstance(IRubyObject recv, IRubyObject[] args, Block

for (int i = (name == null && !nilName) ? 0 : 1; i < args.length; i++) {
if (i == args.length - 1 && args[i] instanceof RubyHash) break;
member.append(runtime.newSymbol(args[i].asJavaString()));

IRubyObject arg = args[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement over what is there but if we can accept any type at this point then RubyString should be checked so we can retrieve bytelist vs javastring and we will maintain encoding in mbc. I half thought about making an interface for RubySymbol/RubyString like ByteListContent so we don't special check these both as separate branches (although they don't have same method name either).

Unrelated to this PR but I saw this code before and wondered about whether it is possible when we do asJavaString we convert a fixnum to "1" or stuff like that (and don't raise some error).

@headius headius modified the milestones: JRuby 9.2.1.0, JRuby 9.2.2.0 Nov 1, 2018
@enebo enebo removed this from the JRuby 9.2.2.0 milestone Nov 8, 2018
headius added a commit to headius/jruby that referenced this pull request Nov 21, 2018
We would like to have the table consider encoding, but until we
only use ByteList for identifiers this is not possible. We still
use Java's String in many places, in which case we only have
access to the raw bytes and not the encoding they should be
associated with, so the String-based lookup would never be able
to add this additional mask.

Unfortunately removing this additional hash dimension breaks one
Symbol marshal spec, but it was broken before this work as well.

This was the cause of Kernel#eval regressions in jruby#4576.
headius added a commit that referenced this pull request Nov 21, 2018
Revisit symbol hash and marshal fixes from #4576
@headius headius closed this Nov 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants