-
-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190
Comments
Do you have any other environment that might be affecting this? I can't reproduce on MacOS at least:
|
Our logic does appear to match MRI...if the string is already frozen it is just returned, otherwise it is frozen and deduplicated. static VALUE
str_uminus(VALUE str)
{
if (OBJ_FROZEN(str)) {
return str;
}
else {
return rb_fstring(str);
}
} |
I also do not see this on fedora:
|
I don't think I have an environment that affects this, since I tested on both OpenBSD and on Windows and got the same results on both. Weird that it would work on Mac OS and Linux but not on Windows or OpenBSD. Can you test on Windows? |
@jeremyevans WOW...It does not work on Windows (and Java 9...although I checked 8 and 10 on linux and they both work). WTF |
@jeremyevans ok I take that back a little bit...It does work in a script but fails with -e. So perhaps this is some icky CMD parsing thing in that case. That would not explain why FreeBSD fails though. Also it seems like it should work in CMD command line too. This is weird. |
I don't think it is limited to I think in Sequel's specs, the code is a little different, something akin to: a = :a.to_s.dup
b = :a.to_s.dup
p ((-a).object_id == (-b).object_id) That returns true in ruby 2.5.1 and false in jruby 9.2.0.0 (both in Windows and OpenBSD):
|
There are two launchers for JRuby on non-unix platforms: the bash script and the native executable. On Windows there's only the native executable, so that makes me suspect a problem in there. There's another bug about improperly quoting (I believe) where I mention a Win32 call that lets you get at raw untweaked command line arguments. Can you try both launchers on OpenBSD at least? |
OpenBSD uses the jruby-launcher gem to build a native executable, as otherwise you can't use jruby in a shebang line (unlike Linux, on OpenBSD, the shebang line must refer to a native executable). But I can probably test the bash script manually tomorrow. Note that as my previous comment shows, this is not (only) a command line parsing issue. |
Same results with the bash script:
I've found the command line issue appears to be encoding related. If I force a UTF-8 encoding, it returns
However, forcing a UTF-8 encoding does not fix the
|
@jeremyevans maybe try -X-Dfile.coding=UTF-8 and see if this changes anything. I still don't get it but on windows perhaps file.coding is window31j or something like that...I would still think these would dedup properly though as that encoding. |
With that option I get |
haha @jeremyevans sorry I made two mistakes: -J-Dfile.encoding=UTF-8. |
@enebo That appears to make the same changes as
|
I suspect the symbol example (fstring.rb) fails because the string associated with the symbol is getting interned with another encoding. Even if the string contents are the same, if the encodings differ we don't deup. When I put some logging into our dedup logic, I see the following:
And if I disable booting our Ruby kernel altogether with -Xdebug.parser, it works properly. So somewhere in the libraries that load at boot this particular string is getting deduped as UTF-8. |
Confirmed the script version works if you use something other than 'a'. All of the strings that report encoding mismatches above are 7-bit ASCII (and several are blank) so I'm not sure why they're getting in here as ASCII-8BIT and UTF-8. I did notice that RubyString.hashCode does not consider encoding if the strings are 7-bit and the encoding is ASCII-compatible, which allows a US-ASCII 'a' to retrieve the UTF-8 'a' from the cache. We could modify the hashCode logic to always consider encoding, and then we'd have no conflicts; but we might have multiple copies of some 7-bit strings in there. |
MRI does appear to allow multiple entries for the same bytes but different encoding to live in the fstring cache. a1 = "a".force_encoding("ASCII-8BIT")
a2 = "a".force_encoding("UTF-8")
a1p = (-a1)
a2p = (-a2)
p a1p.object_id == (-a1.dup).object_id # true
p a2p.object_id == (-a2.dup).object_id # true
p a1p.encoding # ASCII-8BIT
p a2p.encoding # UTF-8 |
MRI does not consider the encoding for a 7-bit string's hash code, but for the fstring cache it does specify a comparison function that rejects mismatched encodings. So they'll go in the same bucket but both encodings are preserved. I'll see about making the same change in JRuby. |
For reference, here's the comparison function MRI uses for fstring cache: static int
fstring_cmp(VALUE a, VALUE b)
{
long alen, blen;
const char *aptr, *bptr;
RSTRING_GETMEM(a, aptr, alen);
RSTRING_GETMEM(b, bptr, blen);
return (alen != blen ||
ENCODING_GET(a) != ENCODING_GET(b) ||
memcmp(aptr, bptr, alen) != 0);
} |
Hmm, this may be a bit trickier than I'd hoped. All the JDK collections, including the ConcurrentHashMap we're using, default to calling So there's a bit more gymnastics required here. |
CRuby's fstring cache, used for frozen string deduplication, uses slightly different equalitye logic than the default equality for strings. Specifically, if two strings have the same 7bit ascii bytes, but two different ascii-compatible strings, the strings are still considered to be equal. But for the fstring cache, you can register the same 7-bit string with different ascii-compatible encodings and they both live in the cache. In JRuby, we use a standard JDK collection, ConcurrentHashMap, that always uses the standard equals() method that works like normal String equality as described above. We are forced to use a wrapper, both for storage and for lookup. This patch introduces that wrapper, and also introduces a thread-local caching mechanism to reduce the cost of looking up deduplicated strings in the cache. The additional overhead of the cache is: * The wrapper object and indirecting through it. * Constructing a wrapper object (only when the previous lookup added a new wrapper or this is the first lookup). * Accessing a previously cached wrapper via a thread-local (inverse of the above conditions) In the typical case, where the requested string has already been deduplicated, the system should eventually get to a point where there's no new entries being added, and the cached wrapper is used every time. There may be more overhead at startup to create the wrappers. There may be few calls to lookup a string that do not trigger a new entry, since most language constructs (FrozenString in IR for example) save the result, making the cache perhaps unnecessary.
This spec passes fine on JRuby since jruby/jruby#5190.
This spec passes fine on JRuby since jruby/jruby#5190.
Environment
Also verified on:
Unlikely to be operating system or java version specific.
Expected Behavior
String deduplication via
String#-@
does not work the same as CRuby 2.5 for non-literal strings:Actual Behavior
The text was updated successfully, but these errors were encountered: