Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider encoding when deduplicating strings. #5198

Merged
merged 1 commit into from Sep 19, 2018

Conversation

headius
Copy link
Member

@headius headius commented May 26, 2018

Fixes #5190.

CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.

In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.

The additional overhead of the cache is:

  • The wrapper object and indirecting through it.
  • Constructing a wrapper object (only when the previous lookup
    added a new wrapper or this is the first lookup).
  • Accessing a previously cached wrapper via a thread-local
    (inverse of the above conditions)

In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.

CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.

In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.

The additional overhead of the cache is:

* The wrapper object and indirecting through it.
* Constructing a wrapper object (only when the previous lookup
  added a new wrapper or this is the first lookup).
* Accessing a previously cached wrapper via a thread-local
  (inverse of the above conditions)

In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.
@headius headius added this to the JRuby 9.2.1.0 milestone May 26, 2018
@headius headius requested review from enebo and kares May 26, 2018 01:21
@headius headius changed the title Consider encoding when deduplicating strings. Fixes #5190. Consider encoding when deduplicating strings. May 26, 2018
@kares
Copy link
Member

kares commented May 26, 2018

LGTM

@kares
Copy link
Member

kares commented May 26, 2018

But for the fstring cache, you can register the same 7-bit string with different ascii-compatible encodings and they both live in the cache.

just a random thought - how many single byte ascii-compatible encodings are there to manage? (a few)
so what about separate stores for encodings ... guess it wouldn't be as performant due double lookup?

@enebo enebo merged commit e7ab40f into jruby:master Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants