Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative hash function #5146

Merged
merged 21 commits into from Oct 30, 2017
Merged

Alternative hash function #5146

merged 21 commits into from Oct 30, 2017

Conversation

funny-falcon
Copy link
Contributor

This is multiplicative function resilent to HashDos attack.
(Claim is bounded with restrictions that neither seed nor hash-value is exposed to attacker).

It provides good performance, and at least certainly not slower than current unsafe hasher's hash function.

This PR is alternative to #5104
(though I don't claim this function is alternative to SipHash in all possible SipHash's applications).

it is multiplicative function resilent to HashDos given following
restriction is satisfied:
- neither seed nor hash-value is exposed to atacker.
end

@[NoInline]
def string(value : Bytes)
Copy link
Contributor

@akzhan akzhan Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be named def slice(value : Bytes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for the case, when slice of bytes is a string.
Ok, maybe it could be def bytes(value : Bytes) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, def bytes(value : Bytes) or just def bytes(value) since other methods don't specify a type either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysbaddaden , should I make Slice(UInt8)#hash to use hasher.bytes(self) as you did in 7733d3e ?

@asterite
Copy link
Member

Can we include a reference or link to somewhere that explains the algorithm?

@funny-falcon
Copy link
Contributor Author

There is derived from my https://github.com/funny-falcon/funny_hash/blob/master/README.md

There is no formal description, though.

It is two multiplicative hash combined that uses different multiplicative constants, and ensures that every block bit affects at least 20 state bits. Guarantee cames from fact both 32bit halves of 64bit block affects are mixed into low 32bits of state halves before multiplication.
Resistance to HashDos came from mixing other half of block into high 32bit of state halves before multiplication, and state halves are rotated between blocks. And from unknown secure random initial state.

Looking at https://en.m.wikipedia.org/wiki/Universal_hashing I see some similarity with hashing vectors.

Lets be clear. I will not spend my time in debates again. I have no formal verification, I never wrote and published any paper on this.
If you are not going to accept the algorithm, please, close this PR now.

@akzhan
Copy link
Contributor

akzhan commented Oct 20, 2017

Looks like https://github.com/funny-falcon/funny_hash -

Simple multiplicative hash function like Murmur3 but a bit safer.

It is safer cause it uses state twice larger than block size, feeds block twice at time and every bit affects at least 10/20 bits of a state. So it is impossible to revert change without introducing other change, and I think it is hardly possible to make seed independent collisions (which is the main defect of Murmur2/Murmur3).

So it is safe for use in a internal hash table implementations when nor seed nor hashsum is known to attacker.

(but it is certainly not cryptographic: I think attack with choosen plaintext and known hashsum will succeed with little effort)

Because it like Murmur3 I provide some implementations that use MurMur3 (but Murmur3 has some known defects while Funny hash - not yet): It has been adopted into a number of open-source projects, most notably libstdc++ (ver 4.6), nginx (ver 1.0.1), Rubinius, libmemcached (the C driver for Memcached), npm (nodejs package manager), maatkit, Hadoop, Kyoto Cabinet, RaptorDB, OlegDB, Cassandra, Solr, vowpal wabbit, Elasticsearch, and Guava.

@akzhan
Copy link
Contributor

akzhan commented Oct 20, 2017

I vote for this hashing function just because it makes us faster than Go, Rust etc. :)

@funny-falcon
Copy link
Contributor Author

it makes us faster than Go, Rust etc. :)

Not faster than Go, actually. Their aes-ni function is super fast on long strings, and probably doesn't suck on short.
Their non-aes-ni based function has quite close performance to mine for short strings, and faster on long strings (cause they hashes 4 blocks at once).

https://github.com/golang/go/blob/master/src/runtime/hash64.go

But their function is longer to code, uses larger state for long strings, and I don't think non-aes variant is safer than my function.

Note, they have specialization for hashing single Int32 and single Int64 (memhash32, memhash64).

They are more free in choosing functions, and their specializations, cause Go doesn't give any ability to provide custom hash functions. All hash calculation is controlled by runtime, and could not be overriden.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Oct 20, 2017

I hashed a dictionary of 216 555 English words:

  • crystal 0.23.1 (UInt64): 5 collisions (1:1)
  • crystal master (UInt64): 0 collisions (!!!)
  • siphash-1-3 (UInt64): 0 collisions
  • halfsiphash-1-3 (UInt64): 0 collisions
  • halfsiphash-1-3 (UInt32): 4 collisions (1:1)
  • this branch (Int32): 9216 collisions (1:455)
  • this branch (Int32): 0 collisions

What!?

EDIT I got the list of english words from this outstanding answer: https://softwareengineering.stackexchange.com/a/145633

EDIT 2: that's better :)

@funny-falcon
Copy link
Contributor Author

@ysbaddaden , thank you. I've pushed fix.
Checked with https://gist.github.com/funny-falcon/0f61703123f56e35b8f4c9a66c8f13cc

@funny-falcon
Copy link
Contributor Author

@ysbaddaden , and I've pushed your commit with specialization for Slice(UInt8)

@ysbaddaden
Copy link
Contributor

Down to zero collisions, that's better!

I'm trying to check distribution, now :)

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Oct 20, 2017

Distribution results for English dictionary of 216 555 words (see a previous comment) :

0.23.1 master siphash funny
hasher-0 23 1 hasher-master hasher-siphash hasher-funny

I can't notice differences between siphash-1-3 and funny_hash 👍

@akzhan
Copy link
Contributor

akzhan commented Oct 20, 2017

LGTM and I want it in master branch.

It is prerequisite for fast hashes.

@ysbaddaden
Copy link
Contributor

I miss specs for the funny hash algorithm against static vectors, just to make sure we don't reintroduce a bug later (such as the collisions above).

@asterite
Copy link
Member

Also, a link to https://github.com/funny-falcon/funny_hash inside the implementation (not in public docs)

@asterite
Copy link
Member

@ysbaddaden By the way, really cool graphs! How did you generate them? (and what code did you use?)

# and then combines them with addition. It greatly reduce
# possibility of state deduction.
#
# Note, it provides good protection from HashDos iif:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iif -> if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or write it without abbreviations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@ysbaddaden
Copy link
Contributor

For fun, I checked distribution a little more, with 200 000 random UUIDs and 42 614 numbers (american zip codes). I got 8 collisions for UUIDs on Crystal 0.23.1, no collisions otherwise.

SipHash and FunnyHash feel similar in their distribution.

UUIDs (overall distribution is good because /dev/urandom has good distribution, I guess):

0.23.1 master siphash funny
uuid-0 23 1 uuid-master uuid-siphash uuid-funny

Numbers (the result is so biased on master hashes end up at two x,y spots):

0.23.1 master siphash funny
zip-0 23 1 zip-master zip-siphash zip-funny

Conclusion: switching to funny hash or siphash will really help avoid concentration in real world scenarios, when keys are poorly distributed, not just mitigating hash flooding attacks.

@asterite I used SDL, see https://gist.github.com/ysbaddaden/fccf87d4ca22082c7d645b9f547a9713

@funny-falcon
Copy link
Contributor Author

Changed tests.

[hasher, hasher1, hasher2, hasher12]
.map(&.result)
.combinations(2)
.each do |(a, b)|
Copy link
Contributor

@Sija Sija Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation here, and in other cases below. Also, I think it'd look better as:

[hasher, hasher1, hasher2, hasher12]
  .map(&.result)
  .combinations(2)
  .each { |(a, b)| a.should_not eq(b) }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You you sure, @Sija?

➜  crystal git:(althasher) ✗ bin/crystal tool format src spec
Using compiled compiler at `.build/crystal'
➜  crystal git:(althasher) ✗ gd
➜  crystal git:(althasher) ✗

Copy link
Contributor

@Sija Sija Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akzhan yes I am, block body SHOULD be indented, yet it's not. Perhaps it's a formatter bug?

Copy link
Contributor

@akzhan akzhan Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But idea with {} blocks on (should) looks better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crystal tool format made this indentation.

Using curly braces here is semantically wrong: assertion is an action, not computation or transformation.

But if you insist, I will change to curly braces. Should I?

Copy link
Contributor

@akzhan akzhan Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@funny-falcon lets stay things as is. they automatically fixed with fixes on the formatter.

/cc @makenowjust

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block body SHOULD be indented, yet it's not. Perhaps it's a formatter bug?

It is indented relative to statement level. It is not a bug, just current rules. May be rule should be changed, but which way?

May be, method calls should be indented by 4 spaces?

[hasher, hasher1, hasher2, hasher12]
    .map(&.result)
    .combinations(2)
    .each do |(a, b)|
  a.should_not eq(b)
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luislavena IMO there are 2 separate issues here: formatter bug and bad readability of do ... end version of block bodies - first one should be handled in separate PR, yet the second one results in badly readable code and could be fixed, so why not do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is matter of multi-line vs one-line block bodies, not the "actionable" intent,

There are different points of view on this question. I prefer curly braces only for "one-line non-actionable".

But, ok, I've changed to curly braces.

Copy link
Member

@asterite asterite Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PLEASE, let's not discuss such small syntax details. If the specs pass it's OK. This only makes the contributor feel tired and annoyed about stuff that doesn't really matter.

There's no formatter bug, the expression starts at indent 0 and the block is at indent 2.

@Sija Sija mentioned this pull request Oct 27, 2017
@RX14
Copy link
Contributor

RX14 commented Oct 27, 2017

I'm not sure why we've suddenly abstracted the implementation from Crystal::Hasher. Did anyone ask for this? What's the point when the implementation isn't replacable? It just seems that now Crystal::Hasher is a pointless class which does nothing but forward stuff to the hasher?

@asterite
Copy link
Member

@RX14 You are right, just now I see that change. There should just be Crystal::Hasher, wth :nodoc: so it's hidden from users, but still testable, and that type should have the @a and @b instance vars. The hasher is not meant to be specified or overriden by the user. It's the only hasher that we'll ever use.

@luislavena
Copy link
Contributor

@RX14, @asterite perhaps the wording of this comment was the issue?

@RX14
Copy link
Contributor

RX14 commented Oct 27, 2017

It's fine to leave just Crystal::Hasher with the implementation, and have 2 "parts" to the test suite - ones which test the hash properties, and ones which use test vectors.

@asterite
Copy link
Member

@luislavena Probably.

I think we have an issue where someone sends a PR and gets many comments saying "Please do this, please do that" and the sender just says "OK, OK" and does it, without first having a good discussion about whether that should be done or not. Who decides it? It's pretty hard. I see this happen in many PRs already.

I don't know what the solution is. Maybe not rush doing changes before we finish agreeing on them.

@akzhan
Copy link
Contributor

akzhan commented Oct 27, 2017

Next step is near - FAST hashes :)

@funny-falcon
Copy link
Contributor Author

Lets border line:

  • tests are ok?
  • should I change back to monolit Crystal::Hasher (ie without separate Crystal::Hasher::FunnyHash64) ?
  • is there anything else to do with this PR?

I rush cause I'm not always understand clearly what is asked to do, but I agree with some part of.
So, I try to shorten iteration to get reaction, and fix it later.

Another reason, why I rush, is cause discussion could last forever. I also like to discuss a lot.
With actual changes it is better seen, which version looks better. Therefore, it is easier to come to agreeingment.

@asterite
Copy link
Member

My opinion:

  • tests looks very good now
  • we should change Crystal::Hasher to be a monolith, no need for a separate Crystal::Hasher::FunnyHash64

Once that's done I think we can merge this. If others agree please 👍 on this comment. Once we have enough 👍 here I guess you can do that and then we can merge 😄 🎉

Just to prevent occasional disclosure of state, hide it in `def inspect`.
It will not prevent intentional disclosure, though, because state can be
easily inspected, and `to_s` could be redefined.
@funny-falcon
Copy link
Contributor Author

Ok, returned to monolith.

I've added another one small change, excuse me for that:

  • I've hidden internal state in inspect function to prevent occasional state disclosure.

B
end

alias THasher = Crystal::Hasher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestHasher plz :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@Sija Sija Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gr8, thx!


def nil
@a += @b
@b += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier you use LFSR here, it is not wanted anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both state variable should change in some way.

Combination of addition on @A and lsfr on @b gives period almost 2**128. That is great, but unnecessary.

Current version is "quadrating probing", and gives period 2**64. I think, it is just enough.

Even 2**32 could be enough for intended use case, because if someone triggers 2**32 calls to nil.hash(hasher),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if some triggers 2**32 calls to nil.hash(hasher), they are already successful at DoS-ing.

@RX14 RX14 added this to the Next milestone Oct 30, 2017
@RX14
Copy link
Contributor

RX14 commented Oct 30, 2017

🎉 🎉 🎉

@RX14 RX14 merged commit fbab3a3 into crystal-lang:master Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants