[Python-Dev] Hash collision security issue (now public)

Sat Jan 7 18:57:10 CET 2012

Am 07.01.2012 12:02, schrieb Stefan Behnel:
> Wouldn't Bob Jenkins' "lookup3" hash function fit in here? After all, it's
> portable, known to provide a very good distribution for different string
> values and is generally fast on both 32 and 64 bit architectures.
> 
> http://burtleburtle.net/bob/c/lookup3.c
> 
> The analysis is here:
> 
> http://burtleburtle.net/bob/hash/doobs.html
> 
> It seems that there's also support for generating 64bit hash values
> (actually 2x32bits) efficiently.

This thread as well as the ticket is getting so long that people barely
have a chance to catch up ...

Guido has stated that he doesn't want a completely new hash algorithm
for Python 2.x to 3.2. A new hash algorithm for 3.3 needs a PEP, too.

I've done some experiments with FNV and Murmur3. With Murmur3 128bit
I've seen some minor speed improvements on 64bit platforms. At first I
was surprised but it makes sense. Murmur3 operates on uint32_t blocks
while Python's hash algorithm iterates over 1 byte (bytes, ASCII), 2
bytes (USC2) or 4 bytes (USC4) types. Since most strings are either
ASCII or UCS2, the inner loop of the current algorithm is more tight.

> Admittedly, this may require some adaptation for the PEP393 unicode memory
> layout in order to produce identical hashes for all three representations
> if they represent the same content. So it's not a drop-in replacement.

Is this condition required and implemented at the moment?

Christian