-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash symbols created from ByteList by their encoding. Fixes #1348. #4576
Conversation
The problem here was that when using ObjectSpace (or object_id, or FFI) we need to insert some hidden state into objects. This state goes into the variable table, but does not get marshaled. However, if we had both reserved space for instance variables and for one of these hidden variables, we always would proceed to marshal the object as though it might have instance variables. If none of the non-hidden variables had been set, we marshaled the object as having instance variables, but of zero length. This broke several marshal specs, since objects that would otherwise have no instance variable were marshaled as such. This fix uses a second check against the variable list to ensure that there are real variables to marshal (or an encoding variable) before proceeding to emit the instance variable sigil and list.
This is not directly used other than to provide a more accurate guess as to whether there are any non-hidden variable in use by a given type. 0b6e3c0 fixes the original issue (marshal dumping zero-length instance var lists). The accessors of the lazy fields are now deprecated. Use the read or write calls to get an accessor for the hidden fields.
@enebo That last commit should make this set of changes green, but I wanted to point out that Struct is another place using String/RubyString/RubySymbol/ByteList incorrectly for identifiers. I started to fix it but realized many of the methods I would have to add probably already exist in your branch (e.g. isConstant, defineConstantAt, etc). |
@enebo This branch is indeed green now. But I'm on the fence about including it in 9.1.9.0, even though it is limited to encoding of literal symbols and does not touch (most) identifier logic. |
I will push this and @enebo's bytelist_through_parser branch after testing. |
I have updated this for current master. |
nice work - esp. since all those marshal specs now pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this broke some stuff in a few tests.
I only made one other suggestion otherwise.
@@ -201,7 +201,16 @@ public static RubyClass newInstance(IRubyObject recv, IRubyObject[] args, Block | |||
|
|||
for (int i = (name == null && !nilName) ? 0 : 1; i < args.length; i++) { | |||
if (i == args.length - 1 && args[i] instanceof RubyHash) break; | |||
member.append(runtime.newSymbol(args[i].asJavaString())); | |||
|
|||
IRubyObject arg = args[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an improvement over what is there but if we can accept any type at this point then RubyString should be checked so we can retrieve bytelist vs javastring and we will maintain encoding in mbc. I half thought about making an interface for RubySymbol/RubyString like ByteListContent so we don't special check these both as separate branches (although they don't have same method name either).
Unrelated to this PR but I saw this code before and wondered about whether it is possible when we do asJavaString we convert a fixnum to "1" or stuff like that (and don't raise some error).
We would like to have the table consider encoding, but until we only use ByteList for identifiers this is not possible. We still use Java's String in many places, in which case we only have access to the raw bytes and not the encoding they should be associated with, so the String-based lookup would never be able to add this additional mask. Unfortunately removing this additional hash dimension breaks one Symbol marshal spec, but it was broken before this work as well. This was the cause of Kernel#eval regressions in jruby#4576.
Revisit symbol hash and marshal fixes from #4576
This change appears to work and so far it's passing tests.
After digging into MRI again, I realized a couple things:
When looking up a symbol from C, using a char*, MRI has the same issue we do: they can't know what encoding you intended, so they always use US-ASCII. This is similar to our use of java.lang.String as the key in all structural tables (methods, constants, etc). Note however that MRI's structural tables are keyed off an ID, which is a symbol index that can vary depending on encoding for the same bytes.
The hashing logic to look up a Symbol from a String considers encoding, but only for strings that either have a non-ASCII-compatible encoding or that are not all 7-bit ASCII. Two strings of the same bytes but different ASCII-compatible encodings will not produce two symbols. This reduces the impact on our structural tables since almost all Ruby code uses ASCII-compatible encodings (e.g. not UTF-16), and still strings that are multibyte will almost always have differing bytes. See below:
There are some risks with this change.
My gut feeling is that this can be made safe and mostly proper for a few reasons:
We'll see how this does on the rest of the test suite and evaluate how to proceed.