New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open addressing algorithm for RubyHash #5215
Conversation
|
||
private final int[] internalCopyBins() { | ||
int[] newBins = new int[bins.length]; | ||
for(int i = 0; i < bins.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can now do a simple System.arraycopy(bins, 0, newBins, 0, bins.length)
} | ||
if (entry != null && entry.isLive()) { | ||
if (entry == null) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there's no longer null
buckets for old (removed) entries, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we mark them with isLive() = false and iterate to the first null value (current end of the entries array).
But the better approach is probably set to null when deleting and iterate till size. I will implement this in the next round, this was just quick & dirty to get ruby booting again ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the bins array we have three different values now:
- -1 empty, stop searching
- -2 deleted, jump over it
- >= index of the hash / key
I modified your benchmark to use benchmark-ips and simple loops, since plain benchmark tends not to exercise methods well and block-based loops have much more overhead than simple loops. require 'benchmark/ips'
N = 50_000
small = 1000
tmp = {}
Benchmark.ips do |x|
x.report("Add:") { i = 0; while i < N; i+=1; tmp[small] = true; end }
x.report("Fetch:") { i = 0; while i < N; i+=1; tmp[small]; end }
end Here's numbers with standard options, invokedynamic, and invokedynamic on GraalVM. Before your work:
And after your work:
Looks like a good win so far! |
I did a size comparison. For a Hash with 100_000 entries mapping 0-99_000 to 1, the old impl is 11.847MB in memory and the new version is 11.382MB. If we subtract some of the common objects:
That gives us old and new sizes of 2.547 and 2.082, a savings of 0.465MB or over 18% on the base RubyHash structure. That's pretty good! So based on that I thought I'd try some small hashes like Subtracting the size of the symbol and fixnum, this hash is 166 bytes on the new impl and 52 bytes on the old impl, so this may be justification for packing simple hashes down a bit tighter. |
@headius thanks for the benchmark 👍
Ok! I already had some code for a LinearSearch approach without the bins array. I will try to merge these two implementations tonight. I will also look into getting rid of the RubyHashEntry elements completely, discussed yesterday with @lopex and extracting it into a Util class. Stay tuned 🎸 |
If we can get rid of the RubyHashEntry objects that would be a tremendous memory savings. |
entries = new RubyHashEntry[MRI_INITIAL_CAPACITY]; | ||
bins = new int[MRI_INITIAL_CAPACITY * 2]; | ||
entries = new IRubyObject[MRI_INITIAL_CAPACITY * 2]; | ||
bins = new int[MRI_INITIAL_CAPACITY * 4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not make sense. The entries array is now twice as big but is also twice as fast full (because we store key and value in separate rows now). So I think we could still go with MRI_INITIAL_CAPACITY * 2 for the bins array.
I pushed an initial version of getting rid of the RubyHashEntries. For now, I don't store them anymore but use them as temporary objects. We now store the key and values in separate rows in the entries array. Key's are stored in the even rows and values in the odd rows. This adds quite some complexity from my POV 😢 I haven't benchmarked it yet as I think there is a bug in the script you posted (you never increase small variable, do you?). Would be great if you could have a look at it again and run your benchmarks @headius. I also tried to move this to a Util class but I have some open questions:
However, I we want to implement the LinearSearchHash for small hashes, I think it would be very beneficial if we could extract these two classes... |
90cb4b7
to
f296f24
Compare
@headius I added now a linear search approach for small hashes (< 16 elements). |
@headius any update regarding your benchmarking script or how I should proceed? |
@ChrisBr he is on vacation this week. You could fix the bench and generate some numbers and see if removing the extra types helped (which of course means running fixed benchmark with original RubyHashEntry - linear search. Memory we can probably calculate manually. We would largely remove 3.7M - field size. |
@enebo okay! Here you go: This is the script I used:
Benchmark of this PR
Benchmark open addressing with HashEntry elements, no linear search
Benchmark master
SummaryLooks like linear search approach for 15 elements (the max size for using linear search in MRI) is quite slower than hashing (open/closed addressing). We could try to use a smaller array (e.g. only using for 8 elements) with the cost of adding one more resize. Also 1_000 / 10_000 elements seems to be slower without HashEntry objects. Maybe because the array is general bigger? I don't have a version for linear search with HashEntry objects, @enebo do you think it makes sense to implement such a version for comparison? Thanks |
@ChrisBr @enebo just looking at https://github.com/ruby/ruby/blob/trunk/st.c#L73 is enough to get rid of that hash cache value for me. If we get rid of it we can store the pairs as values in the bin array |
We already got rid of the hash cache value as we only store the key and values in the entries array. The bin array contains the index of the location in the entries array. So you would suggest to store key & value in the bins and entries array? Or getting rid of one array? Getting rid is probably not possible as the one makes it possible to access the key/values in the insertion order (entries) and the other one by the hash key index. cc @lopex |
So I changed the boundary for linear search to 8 elements as 16 seemed clearly too high. These are the results for 7 elements in the hash. Open Addressing without RubyHashEntry
Linear search without RubyHashEntry (this PR)
master
Adding seems to be around the same for master and linear search but fetching is still faster with current master. |
I also tried the fast skip bucket approach with calculating the hash on the fly (because we don't store it in the RubyHashEntry anymore), but is seems to be a lot slower: For linear search with 7 elements (we need to calculate the hash two times because we don't need to calculate the hash to find the bucket as we just iterate)
For open addressing with 16 & 100 elements (we need to calculate the hash only once as we already have it calculated to find the bucket)
See #5215 (comment) for reference |
Ok I'm back :-) Thoughts on what we else we might try to speed up fetches? I think overall I want to go forward with this PR, since the memory sizing and locality will probably make up for a minor degradation in straight-line perf. |
@headius great you're back, hope you had amazing vacations 🌴 🌞 I just saw your conversation on IRC. For linear search, we currently store everything in the entries array (without a bin array) like this:
So we do not cache the hash anymore which means we always do full object comparison (eql) for every entry when we iterate over it (fetch, add, delete ...). If we cache the hash, we could do a quick skip and only check the hash. The idea I discussed with @lopex was now:
However, as we need to store objects, we need to store the hash as boxed int. The other idea you mentioned could also be to introduce another hash array, something like this:
This would introduce some indirection. So far I'm about to introduce the first approach (not done yet) to see the impact. We also still have the RubyHashEntry as intermediate objects which we should git rid of. In general, do we want to merge everything altogether or first only the linear search or what is our preferred way to go forward? |
So I tried the hash caching from #5215 (comment) Result storing hash in the entries array (boxing of the hash int value):
For comparison, this is the result without any hash caching:
edit: updated to make the result more clear |
Just for information, TruffleRuby uses the same layout for small Hashes, a Object[] with (hash, key, value) per entry: https://github.com/oracle/truffleruby/blob/41797d45bd6dc95af17c8829a7dd0e9c4ef3a207/src/main/java/org/truffleruby/core/hash/PackedArrayStrategy.java#L30-L32 |
@eregon thanks, very interesting read! |
@ChrisBr Today @enebo pointed out to me that these benchmarks are probably a best case scenario for hash, which might make the real cost of eliminating the cache a bit harder to see. If these were user-defined objects, for example, they'd have a nontrivial hash calculation cost plus dynamic dispatch plus a Ruby method, which would be much more overhead than say, Float#hash. This is just more justification for doing the tagalong long[] for caching hash values, I guess. |
@headius @enebo I updated the PR and here are the latest numbers: Here are the latest numbers: 7 Entries
50
100
1_000
10_000
Summary
|
a0afbdc
to
09965f3
Compare
This is the script I used for benchmarking. Compared to the previous script, I switched to strings as the hashes for Fixnum is basically the value which made a really bad distribution in the hash.
|
Here are the results with Fixnum as keys: 7 entries
100 entries
1_000 entries
10_000 entries
|
As requested, here are some numbers with a Struct as key:
7 entries
100 entries
1_000 entries
10_000 entries
|
Ok I have some great news! I ran a little keyword argument benchmark against 9.2, 9.2.1, and 9.2.1 with this PR. The results are extremely promising:
So while these changes have a small perf drop on Hotspot C2 versus non-PR 9.2.1 (potentially within margin of error) they are in fact MUCH faster on Graal: around 3x, which is in turn 7x faster than MRI and 14x faster than 9.2 on Hotspot C2. If this is passing tests, I propose we land it now so it can bake for a while on master. I've pushed the PR contents to my improve_kwargs branch, which includes some tweaks for kwarg processing. @ChrisBr This is very exciting! |
09965f3
to
5ae7260
Compare
f12d7fc
to
fc561b1
Compare
which basically just created a linked list. This approach is not possible anymore with open addressing. As open addressing hashes are anyway compacter, we don't need it anymore.
360e1e9
to
6ac904d
Compare
Ok, @enebo and I see no reason not to go ahead and merge! |
just adding a note to myself here: the work here never shipped and this PR got reverted before 9.2.1.0 was released p.s. would be nice to maybe reconsider this for 9.4 |
@kares I would very much like to reintroduce open addressing. The main problem here is that it made an already concurrency-sensitive implementation (chained buckets) even more sensitive, since the number of state changes for any update increased and there's more opportunity for threads to step on each other. I believe that in order to safely support open addressing in Hash, we need to finally make Hash concurrency-safe. That means either locking (heavy-handed solution) or clever use of atomic updates (tricky to do without allocating a state object for every change, which would impact performance). |
This is a very early stage of implementing open addressing for JRuby hashes suggested in #4708. The MRI issue is located here: https://bugs.ruby-lang.org/issues/12142.
After discussing with @headius, he suggested to open a PR to continue discussion here. Please note that this is far from a mergable state.
I tried a very simple benchmark and recognize a small performance improvement:
Gives me with this PR:
Current master:
(Intel Core i7-7820HQ, 32 GB RAM)
However, I would need some guidance how to gather meaningful results.
Currently is also missing the approach to only use one bucket for small hashes. I have this already working in another branch, however, did not see a (speed) improvement. Could be worth to investigate from a memory improvement though.
Thanks for any feedback!