Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change computation of hash value. #4621

Closed
wants to merge 5 commits into from

Conversation

funny-falcon
Copy link
Contributor

To protect against Hash DoS, change the way hash value is computed.
Class|Struct should define method def hashme(hasher) and call
hasher << @ivar inside.

As an option, for speed, and for backward compatibility, def hash
still could be implemented. It will be used for Hash of matched type.
Thread#hash and Enum#hash is implemented as unseeded cause they are
used before StdHasher @@seed is initialized.

But it is better to implement def hashme(hasher).

StdHasher is default hasher that uses hashme and it is used as default
seeded hasher. It also implements fasthash used for fast hash of
primitive types, and unseeded for Enums.

Fixes #4578
Prerequisite for #4557

Copy link
Member

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a decent idea in general but the implementation is very weak. You should also modify def_hash

src/object.cr Outdated
end

# Protocol method for safe hashing
abstract def hashme(hasher)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed to just hash(hasher) and let method overloading do the work. The precedent is in how to_s and to_json work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RX14 , how could I distinguish def hash(hasher) from def hash with responds_to?(:hash) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@funny-falcon just implement def hash(hasher) in base classes to not use responds_to?.

src/object.cr Outdated
@@ -58,13 +58,19 @@ class Object
nil
end

# Generates an `Int` hash value for this object.
# Generates an `Int` hash value for this object (usually `UInt32`)
#
# This method must have the property that `a == b` implies `a.hash == b.hash`.
#
# The hash value is used along with `==` by the `Hash` class to determine if two objects
# reference the same hash key.
abstract def hash
Copy link
Member

@RX14 RX14 Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line, it's redundant because of the method definition below.

elsif self == to_u64
to_u64
else
(self % 0x7fffffffffffffe7_i64).to_i64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this constant deserves a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is prime number.

@@ -267,8 +267,22 @@ struct BigInt < Int
to_s io
end

private def hash_normalize
if self == to_i64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These first two conditions are weird to me, can you explain what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment to Object#hash:

This method must have the property that a == b implies a.hash == b.hash

$ bin/crystal eval 'require "big"; b=BigInt.new(-1000); p b.to_i64, b.to_u64'
Using compiled compiler at `.build/crystal'
-1000
1000

So I can't use just to_u64. Nor I can just use to_i64 cause of numbers between INT64MAX and UINT64_MAX. Although, probably I can use self & 0xffffffffffffffff_u64, but it will be tweaky.

src/enum.cr Outdated
@@ -275,8 +275,16 @@ struct Enum
end

# Returns a hash value. This is the hash of the underlying value.
# Enum hash is exception from random seeding, cause several Hashes are
# initialized on very early stages of bootstrap, and StdHasher seed is not
# filled yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to work around this by adjusting the require order in the prelude.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not too convenient in Crystal.
Issues are cause of:

  • enum Signal used in src/event/signal_handler.cr
  • enum LLVM::Attribute defined in src/llvm/events/cr (don't know, where it is used)
  • class Thread is also used very early in src/thread.cr : @@threads = Set(Thread).new

If you point me how to reorder prelude to fill StdHasher @@seed before this hashes constructed, it will be great.

It could be fixed also with defining special-cased Hash, if def hashkey(key); key.hash; end will be defined, and then overloaded in subclass specially for this use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is src/prelude.cr, but I doesn't check for.

c = normalize_byte(c)
h = 31 * h + c
v = (v<<8) | normalize_byte(c).to_u32
if v >= 0x1000000_u32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has no comment or explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just "take four chars in a row". I could use counter here. Just was lasy to introduce another one variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write a comment, please :)

src/stdhasher.cr Outdated
struct StdHasher
@@seed = StaticArray(UInt32, 7).new{|i| 0_u32}
def self.init
raise "#{@@seed[0]} #{@@seed[1]}" unless @@seed[0] == 0_u32 && @@seed[1] == 0_u32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sanity check remained from debugging. I'll remove it.

src/stdhasher.cr Outdated
# Also it has specialized methods for primitive keys with different seeds.
struct StdHasher
@@seed = StaticArray(UInt32, 7).new{|i| 0_u32}
def self.init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make this part of the @@seed initializer and avoid having an init method? Do you see any other init methods called from the prelude?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works with just filling in StdHasher body. init could be removed.

end

def initialize
@h = Pointer(Impl).malloc(1, Impl.new(@@seed[0], @@seed[1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly what's the point of this it just seems to me you've replicated a class exactly by using 2 structs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not!!!! See StdHasher.build method: it can allocate Impl on a stack. There is no way to allocate class on a stack. Compiler is NOT smart enough!!!

If you want, I may remove StdHasher#initalize() at all, so there will be no malloc in this file.
I don't use it. I put it just for completeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1517 : @asterite thought compiler could be smart enough. But compiler is not smart enough yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need pointers (nor allocate HEAP memory). As long as you access @h (not through an accessor) you can mutate @h properties directly:

struct Hasher
  struct Impl
    property v : Int32
    
    def initialize(@v)
    end
  end
  
  def initialize(seed)
    @h = Impl.new(seed)
  end
  
  def <<(v)
    @h.v += v
  end
  
  def digest
    @h.v
  end
end

hasher = Hasher.new(0)
hasher << 1
hasher << 4
p hasher.digest  # => 5

See https://play.crystal-lang.org/#/r/299v

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializers could then be reduced to:

def self.new
  new(@@seed[0], @@seed[1])
end

protected def initialize(a, b)
  @h = Impl.new(a, b)
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method chaining is nice sugar but not really necessary. Your suggestions for alternate names would all be without chaining as well. I don't see that << would imply chainability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway class/struct creators will have knowledge to return Hasher in hash, for example.

Moreover it's just hasher << whaaaaat pattern.

And compiler will help with type stricture.

Copy link
Contributor

@akzhan akzhan Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, @RX14 already said about def_hash, looks like 80+% authors just will use this macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akzhan, cause it is easy to break with chaining:

# good
def hash(hasher)
  hasher << @a << @b
end
# wrong, but compiles without error
def hash(hasher)
  hasher << @a << @b
  hasher << @c
end
# corrected
def hash(hasher)
  hasher = hasher << @a << @b
  hasher << @c
end

So, resolution:

  • default hasher is a struct without pointers (for safety),
  • it causes that def hasher have to return hasher,
  • << is used, but without chaining (to reduce usage errors),
    Is it correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

src/stdhasher.cr Outdated
self
end

def <<(v : Int8|Int16|Int32|UInt8|UInt16|UInt32)
Copy link
Member

@RX14 RX14 Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use crystal tool format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, use bin/crystal tool format, it's changed in master branch.

@kostya
Copy link
Contributor

kostya commented Jun 26, 2017

why hashme? but not def hash(hasher)

@funny-falcon
Copy link
Contributor Author

@kostya , how could I distinguish def hash(hasher) from def hash with responds_to?(:hash) ?

@kostya
Copy link
Contributor

kostya commented Jun 26, 2017

is responds_to? needed?, may be implement default hash it on Object?

@funny-falcon
Copy link
Contributor Author

@RX14, @kostya, @akzhan you are all right: it is possible to go with hash(hasher) instead of hasme(hasher), and responds_to? isn't need.

I'll put corrected version later today or tommorow.

@@ -143,6 +143,10 @@ struct BigRational < Number
to_f64.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok (and for othet same notes).

hash_normalize.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

@@ -39,6 +39,10 @@ struct BigFloat < Float
to_f64.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

src/float.cr Outdated
hash_normalize.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

src/float.cr Outdated
hash_normalize.hash
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

StdHasher.hashit self
end

def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

src/indexable.cr Outdated
def hash
reduce(31 * size) do |memo, elem|
31 * memo + elem.hash
def hashme(h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher should better be spelled out.

@@ -4,8 +4,8 @@ struct XML::Namespace
def initialize(@document : Node, @ns : LibXML::NS*)
end

def hash
object_id
def hashme(hasher)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps be generalized? Like if a type has method object_id, use this for calculating the hash, whether it is a reference or not.

src/float.cr Outdated
@@ -148,8 +148,17 @@ struct Float32
Printer.print(self, io)
end

private def hash_normalize
t = to_i32
t == self ? t : unsafe_as(Int32)
Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad idea, but I know, that current implementation is also weird.

Will be replaced with #4581 I hope.

Copy link
Contributor Author

@funny-falcon funny-falcon Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is neccessary cause of definition of hash function: "if a==b then hash(a)==hash(b)".

Without it Hash(Int32|Float32, V) is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does #4581 fix hashing of Float32|Float64 ? If so, then it could be better to apply.
Do you want I include it to my patch, or they will be applied in some order?

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@funny-falcon Float hashes in pyhash manner are identical to integer values if no fraction part exists. Feel free to include this, of course.

p [1, 2, 3, 4, 1.0, 2.0_f64, 3.0_f32, 4.0].map &.hash
[1, 2, 3, 4, 1, 2, 3, 4]

p [1, 2, 3, 4, 1.0, 2.0_f64, 3.0_f32, 4.1].map &.hash
[1, 2, 3, 4, 1, 2, 3, 1932733648]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, there is one exception - -1 value. Will fix now. There is one line fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean more 4.25_f64.hash == 4.25_f32 ? If not, then my variant is simpler. If yes, then it is better to use more correct variant.

(note: I believe 4.1_f64 != 4.1_f32 in any case, but I could be mistaken).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5_f32.hash == 0.5_f64.hash # => true
0.75_f32.hash == 0.75_f64.hash # => true
4.1_f32.hash == 4.1_f64.hash # => false - because values differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Ok, I will include that function in next version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be strict:

puts "%.20f" % 4.1
puts "%.20f" % 4.1_f32

4.09999999999999964473
4.09999990463256835938

src/hash.cr Outdated
hash = size
def hashme(hasher)
hasher << size
dgst = hasher.digest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need no shortcuts. digest sounds better than dgst.

@@ -262,11 +262,6 @@ struct JSON::Any
end

# :nodoc:
def hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beause Struct already provides correct hash(hasher).

src/int.cr Outdated
@@ -317,7 +317,15 @@ struct Int
end

def hash
self
if self.is_a? Primitive
Copy link
Contributor

@ysbaddaden ysbaddaden Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case could Int not be a primitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigInt.

@@ -0,0 +1,248 @@
require "secure_random"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SecureRandom is stdlib, not corelib, it shouldn't be injected into prelude (indireclty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But without "secure_random" no safety for hash table.

What do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Crystal::System::Random directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysbaddaden, big thanks!!!! I'll do.

@funny-falcon
Copy link
Contributor Author

btw, it will be ok if I'll push with force this PR? I prefer to keep history clean.
But if it is not acceptable here, I will put "appending" commits.

@akzhan
Copy link
Contributor

akzhan commented Jun 26, 2017

Appending of commits is OK. Result will be squashed and reworded anyway.

src/stdhasher.cr Outdated
end

def <<(v : UInt64)
high = (v>>32).to_u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, Crystal@master has new value.unsafe_as(UInt32) method. AFAIR it's faster for these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? Here is integer to integer conversion. Isn't it compiler primitive already?

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, No.

Crystal compiler is lower than 1.0, so lot of optimizations doesn't exist yet.

You have i64.i32 conversion here.

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wrong :) Just found these primitives :)

src/stdhasher.cr Outdated
if high != 0
self << high
end
self << v.to_u32
Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code isn't compatible with #4581. There is 31/62bit divider.

Copy link
Contributor

@akzhan akzhan Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commented Hashing::Type but you may introduce it as alias StdHasher::Type = Int32.

So please update it, you need these lines for 64bit integers.

  private HASH_BITS     =     31 # sizeof(StdHasher::Type) >= 8 ? 61 : 31

    exp = exp >= 0 ? exp % HASH_BITS : HASH_BITS - 1 - ((-1 - exp) % HASH_BITS)
    x = ((x << exp) & HASH_MODULUS) | x >> (HASH_BITS - exp)

@akzhan
Copy link
Contributor

akzhan commented Jun 27, 2017

I suppose that StdHasher should have something like StdHasher::ValueType alias to declare hash value type.

To protect against Hash DoS, change the way hash value is computed.
Class|Struct should define method `def hashme(hasher)` and call
`hasher << @ivar` inside.

As an option, for speed, and for backward compatibility, `def hash`
still could be implemented. It will be used for Hash of matched type.
`Thread#hash` and `Enum#hash` is implemented as unseeded cause they are
 used before `StdHasher @@seed` is initialized.

But it is better to implement `def hashme(hasher)`.

StdHasher is default hasher that uses `hashme` and it is used as default
seeded hasher. It also implements `fasthash` used for fast hash of
primitive types, and `unseeded` for `Enums`.

Fixes crystal-lang#4578
Prerequisite for crystal-lang#4557
though Enum#hash could be re-introduced later for performance.
doesn't include crystal-lang#4581 yet (number normalization)
@funny-falcon
Copy link
Contributor Author

I pushed intermediate changes.
Biggest remaining issue is inclusion of number normalization.
Also open issue is dependency on SecureRandom and inclusion into prelude.

bytesize = name.bytesize
slice = name.to_slice
hasher.raw(bytesize)
while c + 4 < bytesize
Copy link
Contributor

@ysbaddaden ysbaddaden Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use idiomatic crystal (no impact on performance):

slice = name.to_slice

0.step(to: slice.size, by: 4) do |c|
  # ...
end

slice.each do |c|
  # ...
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice looks to allocate Array. Am I mistaken?
Computing hash should not do heap allocation.

Where is step defined? I didn't find it in 'int.cr' (while upto were there) and gave up.
Ah, I see, it is in number.cr. My bad. I'll fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Slice is (Pointer+Size) struct, value type. no heap allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a slice is basically a pointer with a size (so accesses are bounded). No more, no less. Slices can allocate HEAP memory with Slice(T).new(size) or bounded to view any memory (HEAP or stack) with Slice(T).new(pointer, size), which is the case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, i were asleep a bit. I thought about each_slice. Btw, can Slice have overloaded each_slice?

a = rotl(a, 13)
b ^= a + s
b *= 9
aa.value, bb.value = a, b
Copy link
Contributor

@ysbaddaden ysbaddaden Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is idiomatic C, not Crystal. Instead of manually assigning to locals then reassign to pointers, pass by value directly:

def permute(v, s, a, b)
  {a, b}
end

a, b = permute(v, s, a, b)

Or make it a macro, and assume s, a and b variables:

private macro permute(v)
end

Anyway, LLVM will inline the permute call anyway.

Copy link
Contributor

@ysbaddaden ysbaddaden Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, why even pass a and b instead of using @a and @b directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use local variables instead of instance variables to reduce redundant stores to instance variables. I looked to assembly and looks like Crystal/LLVM is not aggressive enought in removing this stores. Especially it is visible in a loop of Bytes permutation.

I use pointers, cause LLVM does bad when to 32bit variables are packed into struct/tuple - it combines them into 64bit argument/result, and function inlining is not able to unsplit them. It is better said, LLVM lacks 'local var struct decomposition' (i don't remember exact name) ie when struct is function local (maybe after inlining), its fields become independent variables. GCC has such optimisation.

Macro is a great advise! Thanks! I'll try it.

current_thread = @@threads.each_key do |thread|
if LibC.pthread_equal(thread.id, current_thread_id) != 0
break thread
end
Copy link
Contributor

@ysbaddaden ysbaddaden Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why use a custom Hash instead of Set?
  2. Why drop idiomatic find for a custom iteration/break?

It's faster isn't an acceptable response: stdlib should be able to use idiomatic Crystal with no impact on performance, (otherwise it means apps should do the same: custom over idiomatic). Here we want a Set and Set#find not a custom value-less Hash and custom loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to overload Hash#hash_key, cause this file runs before StdHasher @@seed initialized. In previous version I overloaded Thread#hash(hasher) to use unseeded permutation.

Which way is better? I'll bind to whatever you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't mind return unseeded Enum#hash. Doubtfully it could be attack vector. Then I could revert Hash subclass in signal_handler either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overloading #hash(hasher) is simpler and avoids additional types. If it achieves the same, let's go for it. I doubt it could be an attack vector for Thread or Signal, and they're only supposed to be used internally.

Note: did you really mean Enum#hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I mean Enum#hash, but I can go with Signal#hash instead. Whatever you prefer.

end

def digest(seed)
a, b = @a, @b
Copy link
Contributor

@ysbaddaden ysbaddaden Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need local values, use @a and @b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want digest to not mutate hasher. So, that is not optimisation, but semantic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misread, and thought they were reassigned later. My bad.

u = buf.to_unsafe
a, b = @a, @b
i = bsz.unsafe_div 4
while i > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, idiomatic Crystal:

i.downto(1) do
  # ...
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll try it.

bsz = buf.size
v = bsz.to_u32 << 24
u = buf.to_unsafe
a, b = @a, @b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need local values, use @a and @b
@ysbaddaden it's here where a & b are reassigned at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, but I looked into generated assembly, and I'm sure enough local variables are needed in Bytes permutation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's here where a & b are reassigned at the end.

you looked to different function.

@@ -73,12 +73,12 @@ struct Struct
# Returns a hash value based on this struct's instance variables hash values.
#
# See also: `Object#hash`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old documentation of hash is detached from anything, you should remove it if you don't need it.

@@ -18,8 +18,10 @@ struct Symbol
# Generates an `Int32` hash value for this symbol.
#
# See also: `Object#hash`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, old documentation is detached.

def hash
object_id
def hash(hasher)
hasher << object_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you use hasher << object_id where in XML::Node#hash and XML::NodeSet#hash you use hasher.raw object_id, why is it different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, missed it when changed there.
it doesn't affect correctness, but for performance it is better to change to raw.
thanks.

@akzhan
Copy link
Contributor

akzhan commented Jun 29, 2017

@funny-falcon Looks like Hasher abstraction is pretty but superfluous. We may go without Hasher. Feel free to change the rules.

@funny-falcon
Copy link
Contributor Author

@akzhan, I don't agree hasher is superfluous. I think it is not pretty but important.
It simplifies more than complicates.
Without hasher it will be hard to achieve "safety", because it encapsulates non-linear dependency on seed.
It makes dependency between individual values in a sequence and result hash less obvious.

Note: current "h = h*33 + next.hash" is also kind of hasher, but very simple and unsafe.

Hasher is also an optimization:

  • without hasher, every value in a sequence have to calculate whole hash with finalizer.
    Hash-function's finalizer usually more expensive than intermediate mixing.
    So hasher saves calls to finalizers for each field of struct, tuple, each element of array.

Most hard thing is handling of Hash#hash, cause it doesn't depend on order of iteration. It complicates hasher protocol :-(

@funny-falcon
Copy link
Contributor Author

Concerning optimization: on the other hand, hasher introduces dependency between elements, and there could be fast function that is "only finalizer", and it could be calculated using instruction parallelism.
It will be certainly less safe, but almost as fast.

But I like hasher idea, and will try to finish this PR in that way.

@akzhan
Copy link
Contributor

akzhan commented Jun 29, 2017

Ok, got it :)

@Sija
Copy link
Contributor

Sija commented Jun 30, 2017

It's gr8 there's so much technical discussion about the topic, but merit aside, it became IMO already way too confusing with comments scattered all around the code, multiple force-pushes, word abbreviations and one letter variable names.

I'd suggest opening a new PR with cleaned up implementation and API based on the comments above, with one letter variable names changed to something readable without digging into the code and abbreviations (StdHasher) replaced with meaningful names.

@asterite
Copy link
Member

asterite commented Jul 1, 2017

If Hasher is a struct, pass it by value to def hash(hasher) and let that method return the hasher. Then you get back the modified struct. Please don't use pointers and obscure tricks.

@funny-falcon
Copy link
Contributor Author

@Sija , ok I'll open new PR and then close this.

@asterite , yes, I'm refactoring code in that way.

@funny-falcon
Copy link
Contributor Author

I opened fresh PR #4675 , and closing it as @Sija suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants