JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190

jeremyevans · 2018-05-24T21:23:45Z

Environment

jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]

Also verified on:

jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 Java HotSpot(TM) 64-Bit Server VM 25.131-b11 on 1.8.0_131-b11 +jit [mswin32-x86_64]

Unlikely to be operating system or java version specific.

Expected Behavior

String deduplication via String#-@ does not work the same as CRuby 2.5 for non-literal strings:

$ ruby -e "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-openbsd]
true

Actual Behavior

$ jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false

The text was updated successfully, but these errors were encountered:

headius · 2018-05-24T21:37:31Z

Do you have any other environment that might be affecting this? I can't reproduce on MacOS at least:

$ jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.1.0-SNAPSHOT (2.5.0) 2018-05-24 98abdb4 Java HotSpot(TM) 64-Bit Server VM 25.161-b12 on 1.8.0_161-b12 +jit [darwin-x86_64]
true

headius · 2018-05-24T21:40:05Z

Our logic does appear to match MRI...if the string is already frozen it is just returned, otherwise it is frozen and deduplicated.

static VALUE
str_uminus(VALUE str)
{
    if (OBJ_FROZEN(str)) {
        return str;
    }
    else {
        return rb_fstring(str);
    }
}

enebo · 2018-05-24T21:42:57Z

I also do not see this on fedora:

./bin/jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 Java HotSpot(TM) 64-Bit Server VM 25.172-b11 on 1.8.0_172-b11 +jit [linux-x86_64]
true

jeremyevans · 2018-05-24T21:44:15Z

I don't think I have an environment that affects this, since I tested on both OpenBSD and on Windows and got the same results on both. Weird that it would work on Mac OS and Linux but not on Windows or OpenBSD. Can you test on Windows?

enebo · 2018-05-24T21:48:26Z

@jeremyevans WOW...It does not work on Windows (and Java 9...although I checked 8 and 10 on linux and they both work). WTF

enebo · 2018-05-24T21:53:58Z

@jeremyevans ok I take that back a little bit...It does work in a script but fails with -e. So perhaps this is some icky CMD parsing thing in that case. That would not explain why FreeBSD fails though. Also it seems like it should work in CMD command line too. This is weird.

jeremyevans · 2018-05-24T22:12:42Z

I don't think it is limited to -e. I noticed the issue because Sequel's specs failed with JRuby 9.2.0.0, and -e was definitely not involved there.

I think in Sequel's specs, the code is a little different, something akin to:

a = :a.to_s.dup
b = :a.to_s.dup
p ((-a).object_id == (-b).object_id)

That returns true in ruby 2.5.1 and false in jruby 9.2.0.0 (both in Windows and OpenBSD):

$ ruby -v fstring_test.rb
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-openbsd]
true
$ jruby -v fstring_test.rb
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false

headius · 2018-05-25T02:30:00Z

There are two launchers for JRuby on non-unix platforms: the bash script and the native executable. On Windows there's only the native executable, so that makes me suspect a problem in there. There's another bug about improperly quoting (I believe) where I mention a Win32 call that lets you get at raw untweaked command line arguments.

Can you try both launchers on OpenBSD at least?

jeremyevans · 2018-05-25T04:05:40Z

OpenBSD uses the jruby-launcher gem to build a native executable, as otherwise you can't use jruby in a shebang line (unlike Linux, on OpenBSD, the shebang line must refer to a native executable). But I can probably test the bash script manually tomorrow.

Note that as my previous comment shows, this is not (only) a command line parsing issue.

jeremyevans · 2018-05-25T15:11:12Z

Same results with the bash script:

$ PATH=/usr/local/jdk-1.8.0/bin/:$PATH bin/jruby.bash -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false

I've found the command line issue appears to be encoding related. If I force a UTF-8 encoding, it returns true:

$ LC_CTYPE=en_US.UTF-8 jruby -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
true

However, forcing a UTF-8 encoding does not fix the fstring_test.rb example above:

$ LC_CTYPE=en_US.UTF-8 jruby -v fstring_test.rb
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false

enebo · 2018-05-25T16:29:39Z

@jeremyevans maybe try -X-Dfile.coding=UTF-8 and see if this changes anything. I still don't get it but on windows perhaps file.coding is window31j or something like that...I would still think these would dedup properly though as that encoding.

jeremyevans · 2018-05-25T16:53:48Z

With that option I get jruby: invalid extended option -Dfile.coding=UTF-8 (-X will list valid options) followed by the usage output (both on OpenBSD and on Windows).

enebo · 2018-05-25T17:00:42Z

haha @jeremyevans sorry I made two mistakes: -J-Dfile.encoding=UTF-8.

jeremyevans · 2018-05-25T17:04:31Z

@enebo That appears to make the same changes as LC_CTYPE=en_US.UTF-8, fixing the command line example, but still not having the fstring_test.rb example work:

$ jruby -J-Dfile.encoding=UTF-8 -ve "a = String.new('a'); p ((-a).object_id == (-a).object_id)"
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
true
$ jruby -J-Dfile.encoding=UTF-8 -v ../fstring_test.rb
jruby 9.2.0.0 (2.5.0) 2018-05-24 81156a8 OpenJDK 64-Bit Server VM 25.144-b01 on 1.8.0_144-b01 +jit [OpenBSD-x86_64]
false

headius · 2018-05-25T19:33:48Z

I suspect the symbol example (fstring.rb) fails because the string associated with the symbol is getting interned with another encoding. Even if the string contents are the same, if the encodings differ we don't deup.

When I put some logging into our dedup logic, I see the following:

+ is UTF-8 and + is US-ASCII
- is UTF-8 and - is US-ASCII
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
 is ASCII-8BIT and  is UTF-8
method is ASCII-8BIT and method is UTF-8
 is ASCII-8BIT and  is UTF-8
a is UTF-8 and a is US-ASCII
a is UTF-8 and a is US-ASCII
false

And if I disable booting our Ruby kernel altogether with -Xdebug.parser, it works properly. So somewhere in the libraries that load at boot this particular string is getting deduped as UTF-8.

headius · 2018-05-25T19:40:00Z

Confirmed the script version works if you use something other than 'a'.

All of the strings that report encoding mismatches above are 7-bit ASCII (and several are blank) so I'm not sure why they're getting in here as ASCII-8BIT and UTF-8. I did notice that RubyString.hashCode does not consider encoding if the strings are 7-bit and the encoding is ASCII-compatible, which allows a US-ASCII 'a' to retrieve the UTF-8 'a' from the cache.

We could modify the hashCode logic to always consider encoding, and then we'd have no conflicts; but we might have multiple copies of some 7-bit strings in there.

headius · 2018-05-25T20:00:08Z

MRI does appear to allow multiple entries for the same bytes but different encoding to live in the fstring cache.

a1 = "a".force_encoding("ASCII-8BIT")
a2 = "a".force_encoding("UTF-8")
a1p = (-a1)
a2p = (-a2)
p a1p.object_id == (-a1.dup).object_id # true
p a2p.object_id == (-a2.dup).object_id # true
p a1p.encoding # ASCII-8BIT
p a2p.encoding # UTF-8

headius · 2018-05-25T20:10:09Z

MRI does not consider the encoding for a 7-bit string's hash code, but for the fstring cache it does specify a comparison function that rejects mismatched encodings. So they'll go in the same bucket but both encodings are preserved. I'll see about making the same change in JRuby.

headius · 2018-05-25T20:31:22Z

For reference, here's the comparison function MRI uses for fstring cache:

static int
fstring_cmp(VALUE a, VALUE b)
{
    long alen, blen;
    const char *aptr, *bptr;
    RSTRING_GETMEM(a, aptr, alen);
    RSTRING_GETMEM(b, bptr, blen);
    return (alen != blen ||
            ENCODING_GET(a) != ENCODING_GET(b) ||
            memcmp(aptr, bptr, alen) != 0);
}

headius · 2018-05-25T20:35:10Z

Hmm, this may be a bit trickier than I'd hoped.

All the JDK collections, including the ConcurrentHashMap we're using, default to calling equals and there's no way to replace that without rewriting the collection. RubyString#equals does not consider encoding (in MRI, only if 7bit; in JRuby it seems to always be ignored if the bytes are the same).

So there's a bit more gymnastics required here.

CRuby's fstring cache, used for frozen string deduplication, uses slightly different equalitye logic than the default equality for strings. Specifically, if two strings have the same 7bit ascii bytes, but two different ascii-compatible strings, the strings are still considered to be equal. But for the fstring cache, you can register the same 7-bit string with different ascii-compatible encodings and they both live in the cache. In JRuby, we use a standard JDK collection, ConcurrentHashMap, that always uses the standard equals() method that works like normal String equality as described above. We are forced to use a wrapper, both for storage and for lookup. This patch introduces that wrapper, and also introduces a thread-local caching mechanism to reduce the cost of looking up deduplicated strings in the cache. The additional overhead of the cache is: * The wrapper object and indirecting through it. * Constructing a wrapper object (only when the previous lookup added a new wrapper or this is the first lookup). * Accessing a previously cached wrapper via a thread-local (inverse of the above conditions) In the typical case, where the requested string has already been deduplicated, the system should eventually get to a point where there's no new entries being added, and the cached wrapper is used every time. There may be more overhead at startup to create the wrappers. There may be few calls to lookup a string that do not trigger a new entry, since most language constructs (FrozenString in IR for example) save the result, making the cache perhaps unnecessary.

This spec passes fine on JRuby since jruby/jruby#5190.

headius mentioned this issue May 26, 2018

Consider encoding when deduplicating strings. #5198

Merged

enebo closed this as completed in #5198 Sep 19, 2018

enebo added this to the JRuby 9.2.1.0 milestone Nov 1, 2018

headius added a commit to headius/sequel that referenced this issue Mar 17, 2019

Remove JRuby check from excluded dedup spec.

4545d06

This spec passes fine on JRuby since jruby/jruby#5190.

headius mentioned this issue Mar 17, 2019

Remove JRuby check from excluded dedup spec. jeremyevans/sequel#1602

Closed

jeremyevans pushed a commit to jeremyevans/sequel that referenced this issue Mar 17, 2019

Remove JRuby check from excluded dedup spec.

8fe03b3

This spec passes fine on JRuby since jruby/jruby#5190.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190

JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190

jeremyevans commented May 24, 2018

headius commented May 24, 2018

headius commented May 24, 2018

enebo commented May 24, 2018

jeremyevans commented May 24, 2018

enebo commented May 24, 2018

enebo commented May 24, 2018

jeremyevans commented May 24, 2018

headius commented May 25, 2018

jeremyevans commented May 25, 2018

jeremyevans commented May 25, 2018

enebo commented May 25, 2018

jeremyevans commented May 25, 2018

enebo commented May 25, 2018

jeremyevans commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190

JRuby 9.2.0.0 string deduplication via String#-@ does not work the same as CRuby #5190

Comments

jeremyevans commented May 24, 2018

Environment

Expected Behavior

Actual Behavior

headius commented May 24, 2018

headius commented May 24, 2018

enebo commented May 24, 2018

jeremyevans commented May 24, 2018

enebo commented May 24, 2018

enebo commented May 24, 2018

jeremyevans commented May 24, 2018

headius commented May 25, 2018

jeremyevans commented May 25, 2018

jeremyevans commented May 25, 2018

enebo commented May 25, 2018

jeremyevans commented May 25, 2018

enebo commented May 25, 2018

jeremyevans commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018

headius commented May 25, 2018