Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use Crystal::Hasher and openaddressing in StringPool #5000

Merged
merged 1 commit into from Sep 21, 2017

Conversation

akzhan
Copy link
Contributor

@akzhan akzhan commented Sep 18, 2017

cause StringPool is used in json decoding, it is important to have it safe.

I suppose that Crystal::Hasher should be safe somedays.

Openaddressing also may increase performance a bit (#4675 (comment)).

Extracted from #4675.

/cc @RX14

@asterite
Copy link
Member

Why not just change StringPool#hash(str, len) method?

I probably missed the original discussion where everything else had to change.

@akzhan
Copy link
Contributor Author

akzhan commented Sep 18, 2017

@asterite modern hashes prefer open-addressing due to it's locality (CPU cache friendly hashes).

So here is first application of open-addressing in Crystal. Later Hash will be reworked too (I hope just like Ruby 2.4 hashes). Please see #4557 for details.

@values = Pointer(String).malloc(@capacity, "")
@size = 0

0.upto(old_capacity - 1) do |i|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old_capacity.times do |i|

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follows you.

@akzhan
Copy link
Contributor Author

akzhan commented Sep 18, 2017

@asterite Just simple bench:

require "benchmark"
require "./new_string_pool"
require "./old_string_pool"

class Pools
  class_property new_pool = NewStringPool.new
  class_property old_pool = OldStringPool.new
end

Benchmark.ips do |x|
  x.report("insert new string pool") { test_with(Pools.new_pool) }
  x.report("insert old string pool") { test_with(Pools.old_pool) }
end

Benchmark.ips do |x|
  x.report("get new string pool") { test_with(Pools.new_pool) }
  x.report("get old string pool") { test_with(Pools.old_pool) }
end

def test_with(pool)
  1_000_000.times do |num|
    pool.get("hello #{num}")
  end
end
➜  bench_string_pool.cr git:(master) ✗ ../crystal/bin/crystal src/bench.cr --release
Using compiled compiler at `.build/crystal'
insert new string pool   4.96  (201.65ms) (± 4.36%)       fastest
insert old string pool   2.98  (335.58ms) (± 2.34%)  1.66× slower
get new string pool   4.93  (202.81ms) (± 4.25%)       fastest
get old string pool   2.97  (336.19ms) (± 2.50%)  1.66× slower

Take a note that OldStringPool uses Crystal::Hasher too:

  private def hash(str, len)
    hasher = Crystal::Hasher.new
    hasher = str.to_slice(len).hash(hasher)
    hasher.result
  end

Test updated because get never be benchmarked, fixed.

if entry
return entry
mask = (@capacity - 1).to_u32
index, d = hash & mask, 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split this into two assignments? Otherwise it looks a bit cryptic.

Also, could you explain here why & mask is needed? (I know, because later we see the comment about this, but maybe the algorithm can be explained somewhere?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will split.

Mask is just one of methods to get index to probe between 0 and capacity-1. It's fast and OK.

@asterite
Copy link
Member

I like this but I don't quite understand the algorithm. Is it using Quadratic probing?

We can merge this, but a detailed explanation of the algorithm must be attached somewhere in the file.

I'm pretty impressed by the performance!

@akzhan
Copy link
Contributor Author

akzhan commented Sep 19, 2017

@asterite it's just linear probing afair (was renamed some variables to show it). @funny-falcon is author, can describe it better, but it's simple.

I don't know what to describe. It just choose index and again and again :)

@asterite
Copy link
Member

:-D

But why incrementing d by one and probing over and over guarantees that the key will be found? This is what I don't understand. I'm pretty sure there's some math theorem about this, but without a comment explaining that, a link or something, it will become impossible to maintain if nobody knows how and why it works.

@akzhan
Copy link
Contributor Author

akzhan commented Sep 19, 2017

I understood you and will describe this (later, i'm going asleep) :)

In short - specified part of hash table always empty, see rehash etc.

So linear probing will always take empty element (it is not +1 everytime (but +N), but it is guaranteed).

@akzhan
Copy link
Contributor Author

akzhan commented Sep 19, 2017

For now, will replace with trivial linear probing :)

@funny-falcon
Copy link
Contributor

It is called "quadratic probing"
https://en.wikipedia.org/wiki/Quadratic_probing

For m = 2^n, a good choice for the constants are c1 = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] are all distinct. This leads to a probe sequence of h(k),h(k)+1,h(k)+3,h(k)+6,... where the values increase by 1, 2, 3, ...

It suffers less from secondary clustering (compared to linear probing). And still provides good cache locality, because first several probes are nearby. Google Dense Hash uses same quadratic probing.

@akzhan why did you change to linear probing? Do you doubt my sanity?

@akzhan
Copy link
Contributor Author

akzhan commented Sep 20, 2017

I was wrong (will return this)

@akzhan akzhan force-pushed the stringpool-openaddressing branch 2 times, most recently from 817a42a to be20e47 Compare September 20, 2017 08:06
@akzhan
Copy link
Contributor Author

akzhan commented Sep 20, 2017

Was rolled back and described quadratic probing implementation details.

@akzhan akzhan force-pushed the stringpool-openaddressing branch 3 times, most recently from 30e7bb6 to 2f26d2e Compare September 20, 2017 12:20
@funny-falcon
Copy link
Contributor

Good catch on size recalculation on rehash!

I wanted to argue about 64bit hash, but decided not to.

@akzhan
Copy link
Contributor Author

akzhan commented Sep 20, 2017

Crystal::Hasher returns UInt64 hash values. Pointer.[] also operates with 64bit now.

@akzhan
Copy link
Contributor Author

akzhan commented Sep 20, 2017

@asterite ready to review :)

@asterite
Copy link
Member

I'm still looking for a proof that quadratic probing where the capacity if a power of two hits all buckets. Anyone can find one? Maybe a proof can be done by induction, though I haven't done that in a long time now...

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Sep 20, 2017

@funny-falcon
Copy link
Contributor

http://goog-sparsehash.sourceforge.net/doc/implementation.html

Quadratic probing, using the triangular numbers, avoids the clumping while keeping cache coherency in the common case. As long as the table size is a power of 2, the quadratic-probing method described above will explore every table element if necessary, to find a good place to insert.

https://fgiesen.wordpress.com/2015/02/22/triangular-numbers-mod-2n/

You can find lots of places on the web mentioning that the resulting probe sequence will visit every element of a power-of-2 sized hash table exactly once; more precisely, the function f(k) = T_k \mod 2^m is a permutation on { 0, 1, \dots, 2^m-1 }. But it’s pretty hard to find a proof; everybody seems to refer back to Knuth, and in The Art of Compute Programming, Volume 3, Chapter 6.4, the proof appears as an exercise (number 20 in the Second Edition).

@akzhan
Copy link
Contributor Author

akzhan commented Sep 20, 2017

Explanation related to SparseHash has been added.

@asterite
Copy link
Member

I wouldn't link to SparseHash because this is not SparseHash :-)

Maybe link to https://fgiesen.wordpress.com/2015/02/22/triangular-numbers-mod-2n/

But all these comments are private to the implementation, not public documentation. Nobody needs to know how StringPool is implemented internally, as long as its efficient. Ruby's Hash implementation doesn't have a comment explaining all of the internals; it's documented in the C code, but not publicly visible. Please do the same here (just put it below the class StringPool line).

…ingPool

cause StringPool is used in json decoding, it is important to have it safe.
@akzhan
Copy link
Contributor Author

akzhan commented Sep 20, 2017

@asterite Done

@asterite asterite merged commit 9ec8327 into crystal-lang:master Sep 21, 2017
@akzhan
Copy link
Contributor Author

akzhan commented Sep 21, 2017

Glad to see this part merged, and will step further!

Thanks to @funny-falcon for its great work!

@akzhan akzhan deleted the stringpool-openaddressing branch September 21, 2017 20:58
@asterite
Copy link
Member

Indeed, thank you @funny-falcon and @akzhan 🙇

@mverzilli mverzilli added this to the Next milestone Sep 21, 2017
@vladfaust
Copy link
Contributor

Nice 5k get, BTW! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants