-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Hash should not default to 11 if it is specified #4033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash should not default to 11 if it is specified #4033
Conversation
My guess is that it was set up this way since the difference in size between 1-10 is negligible, and creating a hash as small as under 10 elements can yield terrible performance. @asterite do you want to chime in? |
The capacity must always match values in |
However, it's OK to raise an ArgumentError if the capacity is negative. And I guess we should cap the initialize capacity to one of the prime values (right now that's not the case), though I'm not sure that's really needed. |
Cap the initial capacity to one of the primes, or actually try to match it to the closest largest one? For instance, if a user creates a hash with initial capacity 12, should we actually create a hash with size 19? |
I don't know... A hash literal will set the initial capacity to the number of elements it has. If the initial capacity is less than what is passed there will be a rehash of all the values, which is not good. So I think using a larger value is better. |
Ok, then we should:
Agree? |
Happy to do this, but want to add: I think we should at least warn the user about this rule with a message. Or, I can add documentation for it. |
I'd avoid a warning on runtime for that, but I agree on adding it to the documentation; however, I'd wait to define whether we want to just keep the current behaviour, or go for always using the next largest prime. |
We went through the Ruby implementation, and it seems that when a capacity is passed in, Ruby looks for the largest power of 2 smaller than the requested value, and uses the prime generated from that power of 2 as the actual capacity. See here for the code. I'd suggest we copy that implementation, and add a notification about it on the docs as @dylandrop suggested; as well as raise the exception if the capacity is negative. @dylandrop this is actually exceeding the original scope of this PR. Would you like to add this implementation here, or would you rather close this PR and we open a new issue for you or someone else to work on it? |
@spalladino No I can take care of it -- seems reasonable. |
Awesome, thanks a lot!!
…On Feb 16, 2017 4:38 PM, "Dylan Drop" ***@***.***> wrote:
@spalladino <https://github.com/spalladino> No I can take care of it --
seems reasonable.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4033 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAaOJJTiF1QUDEdrqYUGiSWhXVm26F44ks5rdKXLgaJpZM4L__y->
.
|
@spalladino @asterite Alright, we're good for another look. |
spec/std/hash_spec.cr
Outdated
it "creates with initial capacity" do | ||
it "sets the initial capacity when not passed" do | ||
hash = Hash(Int32, Int32).new | ||
hash.@buckets_size.should eq(PRIMES[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRIMES
-> Hash::PRIMES
src/hash.cr
Outdated
@@ -834,6 +864,50 @@ class Hash(K, V) | |||
find_entry_in_bucket entry, key | |||
end | |||
|
|||
# Table of prime numbers 2^n+a, 2<=n<=30. | |||
PRIMES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can safely be a Tuple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a StaticArray; all elements are the same type (Int32).
Int32.static_array(8 + 3, 16 + 3, ...)
c23201d
to
49ecf43
Compare
49ecf43
to
cd48b2b
Compare
Interesting, it seems from the Travis build that using this approach leads to some very large memory allocation, causing the process to crash. I'm wondering if I should just use the size the caller passed in if the passed size exceeds a certain value (say Or, perhaps there's some bug in my code I'm not seeing. Output at the bottom of the Travis job:
|
src/hash.cr
Outdated
initial_capacity = initial_capacity.to_i | ||
raise ArgumentError.new("Hash capacity must be positive") if initial_capacity <= 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should raise if initial_capacity == 0
The user might want to create a Hash with initial capacity of 0. Even if it seems pointless, there might be some cases where it is usefull!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
For now I thinking about another implementation of Ruby 2.4 increases its Hash performance for 10-200% with this change (see https://bugs.ruby-lang.org/issues/12142#note-142 for details). |
@akzhan I'm sure a PR using open addressing would be accepted, it's been discussed a few times for sure. However that's pretty irrelevant to this current issue. |
Just asked @funny-falcon for some details (its implementation looks cleaner for me). |
I'd be happy to make PR with open-addressing implementation. |
@funny-falcon Feel free to publish clean idea implemented in C++14, it's easy to convert good C++ code to Crystal. |
See this blog post when implementing a hash table: http://www.idryman.org/blog/2017/05/03/writing-a-damn-fast-hash-table-with-tiny-memory-footprints/ |
Also make sure to read the HN comments and the references at the end of the article. There's a load of good resources. |
@RX14 , I will try to reimplement same approach, used for Ruby. Perhaps it is not the best. I think, there could be concurrent PR with different implementations, so you can implement other one. @akzhan , there is interest to use C++ :-) I don't know Crystal well yet, so it will be good reason to learn more. |
@funny-falcon Please mention #4557 as base issue for pull request, thanks. |
This has been fixed in #8017 |
Hash#new
has some strange behavior of setting the initial capacity, ignoring what is specified, if the specified value is less than 11. Instead, it should let the user set the capacity to whatever they want, but complain if a non-positive value is specified.