Consider encoding when deduplicating strings. #5198
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5190.
CRuby's fstring cache, used for frozen string deduplication, uses
slightly different equalitye logic than the default equality for
strings. Specifically, if two strings have the same 7bit ascii
bytes, but two different ascii-compatible strings, the strings are
still considered to be equal. But for the fstring cache, you can
register the same 7-bit string with different ascii-compatible
encodings and they both live in the cache.
In JRuby, we use a standard JDK collection, ConcurrentHashMap,
that always uses the standard equals() method that works like
normal String equality as described above. We are forced to use
a wrapper, both for storage and for lookup. This patch introduces
that wrapper, and also introduces a thread-local caching mechanism
to reduce the cost of looking up deduplicated strings in the
cache.
The additional overhead of the cache is:
added a new wrapper or this is the first lookup).
(inverse of the above conditions)
In the typical case, where the requested string has already been
deduplicated, the system should eventually get to a point where
there's no new entries being added, and the cached wrapper is used
every time. There may be more overhead at startup to create the
wrappers. There may be few calls to lookup a string that do not
trigger a new entry, since most language constructs (FrozenString
in IR for example) save the result, making the cache perhaps
unnecessary.