Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash should not default to 11 if it is specified #4033

Closed

Conversation

dylandrop
Copy link
Contributor

Hash#new has some strange behavior of setting the initial capacity, ignoring what is specified, if the specified value is less than 11. Instead, it should let the user set the capacity to whatever they want, but complain if a non-positive value is specified.

@spalladino
Copy link
Contributor

My guess is that it was set up this way since the difference in size between 1-10 is negligible, and creating a hash as small as under 10 elements can yield terrible performance. @asterite do you want to chime in?

@asterite
Copy link
Member

The capacity must always match values in HASH_PRIMES to get less hash collitions, if I remember correctly

@asterite
Copy link
Member

However, it's OK to raise an ArgumentError if the capacity is negative. And I guess we should cap the initialize capacity to one of the prime values (right now that's not the case), though I'm not sure that's really needed.

@spalladino
Copy link
Contributor

Cap the initial capacity to one of the primes, or actually try to match it to the closest largest one? For instance, if a user creates a hash with initial capacity 12, should we actually create a hash with size 19?

@asterite
Copy link
Member

I don't know... A hash literal will set the initial capacity to the number of elements it has. If the initial capacity is less than what is passed there will be a rehash of all the values, which is not good. So I think using a larger value is better.

@spalladino
Copy link
Contributor

Ok, then we should:

  • Create an issue to use the next largest prime instead of the actual inicial capacity set by the user
  • Kindly ask @dylandrop to keep the original if < 11 then 11, and submit this PR with the exception on negative capacity only

Agree?

@dylandrop
Copy link
Contributor Author

Kindly ask @dylandrop to keep the original if < 11 then 11, and submit this PR with the exception on negative capacity only

Happy to do this, but want to add: I think we should at least warn the user about this rule with a message. Or, I can add documentation for it.

@spalladino
Copy link
Contributor

I'd avoid a warning on runtime for that, but I agree on adding it to the documentation; however, I'd wait to define whether we want to just keep the current behaviour, or go for always using the next largest prime.

@spalladino
Copy link
Contributor

We went through the Ruby implementation, and it seems that when a capacity is passed in, Ruby looks for the largest power of 2 smaller than the requested value, and uses the prime generated from that power of 2 as the actual capacity. See here for the code.

I'd suggest we copy that implementation, and add a notification about it on the docs as @dylandrop suggested; as well as raise the exception if the capacity is negative.

@dylandrop this is actually exceeding the original scope of this PR. Would you like to add this implementation here, or would you rather close this PR and we open a new issue for you or someone else to work on it?

@dylandrop
Copy link
Contributor Author

@spalladino No I can take care of it -- seems reasonable.

@spalladino
Copy link
Contributor

spalladino commented Feb 16, 2017 via email

@dylandrop
Copy link
Contributor Author

@spalladino @asterite Alright, we're good for another look.

it "creates with initial capacity" do
it "sets the initial capacity when not passed" do
hash = Hash(Int32, Int32).new
hash.@buckets_size.should eq(PRIMES[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRIMES -> Hash::PRIMES

src/hash.cr Outdated
@@ -834,6 +864,50 @@ class Hash(K, V)
find_entry_in_bucket entry, key
end

# Table of prime numbers 2^n+a, 2<=n<=30.
PRIMES = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can safely be a Tuple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a StaticArray; all elements are the same type (Int32).

Int32.static_array(8 + 3, 16 + 3, ...)

@dylandrop dylandrop force-pushed the hash-should-not-default-to-11 branch from c23201d to 49ecf43 Compare March 4, 2017 18:12
@dylandrop dylandrop force-pushed the hash-should-not-default-to-11 branch from 49ecf43 to cd48b2b Compare March 4, 2017 21:21
@dylandrop
Copy link
Contributor Author

Interesting, it seems from the Travis build that using this approach leads to some very large memory allocation, causing the process to crash. I'm wondering if I should just use the size the caller passed in if the passed size exceeds a certain value (say 8192 + 27).

Or, perhaps there's some bug in my code I'm not seeing.

Output at the bottom of the Travis job:

GC Warning: Repeated allocation of very large block (appr. size 268439552):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 268439552):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 1073745920):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 2147487744):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 33558528):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 67112960):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 134221824):
	May lead to memory leak and poor performance
GC Warning: Repeated allocation of very large block (appr. size 268439552):
	May lead to memory leak and poor performance

src/hash.cr Outdated
initial_capacity = initial_capacity.to_i
raise ArgumentError.new("Hash capacity must be positive") if initial_capacity <= 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should raise if initial_capacity == 0

The user might want to create a Hash with initial capacity of 0. Even if it seems pointless, there might be some cases where it is usefull!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@akzhan
Copy link
Contributor

akzhan commented Jun 12, 2017

For now I thinking about another implementation of Hash itself, using open addressing and powers of two.

Ruby 2.4 increases its Hash performance for 10-200% with this change (see https://bugs.ruby-lang.org/issues/12142#note-142 for details).

@RX14
Copy link
Contributor

RX14 commented Jun 12, 2017

@akzhan I'm sure a PR using open addressing would be accepted, it's been discussed a few times for sure. However that's pretty irrelevant to this current issue.

@akzhan
Copy link
Contributor

akzhan commented Jun 12, 2017

Just asked @funny-falcon for some details (its implementation looks cleaner for me).

@funny-falcon
Copy link
Contributor

I'd be happy to make PR with open-addressing implementation.
Though I'm not very organized man. So if i will not create PR within 3 weeks, better do it yourself.

@akzhan
Copy link
Contributor

akzhan commented Jun 12, 2017

@funny-falcon Feel free to publish clean idea implemented in C++14, it's easy to convert good C++ code to Crystal.

@RX14
Copy link
Contributor

RX14 commented Jun 12, 2017

See this blog post when implementing a hash table: http://www.idryman.org/blog/2017/05/03/writing-a-damn-fast-hash-table-with-tiny-memory-footprints/

@RX14
Copy link
Contributor

RX14 commented Jun 12, 2017

Also make sure to read the HN comments and the references at the end of the article. There's a load of good resources.

@funny-falcon
Copy link
Contributor

funny-falcon commented Jun 12, 2017

@RX14 , I will try to reimplement same approach, used for Ruby. Perhaps it is not the best.

I think, there could be concurrent PR with different implementations, so you can implement other one.

@akzhan , there is interest to use C++ :-) I don't know Crystal well yet, so it will be good reason to learn more.
Edit: there is no interest to use C++.

@akzhan
Copy link
Contributor

akzhan commented Jun 13, 2017

@funny-falcon Please mention #4557 as base issue for pull request, thanks.

@straight-shoota
Copy link
Member

This has been fixed in #8017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet