General Hash Functions In Python

Mon Jul 17 18:57:07 EDT 2006

On 17/07/2006 10:13 PM, Arash Partow wrote:
> John Machin wrote:
>> Who is likely to bother? In timbot we trust. Have you read the comments
>> at the start of Objects/dictobject.c?
>>
> No I haven't probably wont be anytime soon,

Perhaps you should, if you profess an interest in hashed lookup -- it 
gives some interesting commentary on the second aspect: collision 
handling. What matters is the *total* time to return an answer.

> as far as time, well
> people interested, as is how started my original port, would be more
> than willing to try/assess the routines for sets of strings that they
> wish to hash etc.

> this site may help explain plus has some code
> snippets that may help you understand what I mean.
> 
> http://www.partow.net/programming/hashfunctions/index.html

That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's 
a Melbourne academic named Justin Zobel who has done something amazingly 
similar:

http://goanna.cs.rmit.edu.au/~hugh/zhw-ipl.html

Searching for "Justin Sobel" did lead me to a Russian website which 
apart from repeating your typo/reado/whatevero did propose (and test) 
some more hash functions.

http://vak.ru/doku.php/proj/hash/efficiency-en

In particular look at the "rot13" function which was right up near the 
front as far as number of collisions goes, and which would appear (my 
guess based on reading the source) to be very fast (with the right 
compiler (e.g. gcc 4)).

> 
> 
>> A few more questions:
>>
>> Have you tested that these functions produce the same output (apart from
>> the 31-bit thing) as the originals that you worked from? The reason for
>> asking is that Python unlike C doesn't lose any bits in the <<
>> operation; if this is followed by a >> you may be shifting some unwanted
>>   1-bits back in again.
>>
> 
> Some of them do others don't (not really important unless you are
> trying to be compatible with other implementations which this is
> not),

I would have thought it important especially in the case of well-known 
functions whose properties have been discussed in the literature that 
you should not publish a version that gives a different answer, without 
noting that fact prominently.

> I am aware of python not truncating/wrapping of values under
> various operations, I believe its a nice little side effect from
> python which gives more bits to play with as long as you don't
> truncate them as I have.
> 
> 
>> Talking about not losing bits: For your 36-byte example input, the
>> SDBMHash (with its << 16) is up to about 566 bits before you truncate it
>> to 31. A little over the top, perhaps. Maybe not the fastest way of
>> doing it.
>>
> Possibly, do you have a better solution I'm very keen to learn...

You can't avoid using Python longs if you want to simulate unsigned 
32-bit arithmetic. However judicious truncation can be used to stop the 
longs becoming longer and slower. General rules:
1. Avoid exceeding 32 bits where possible. E.g. instead of
     hash <<= 8
do
     hash = (hash & 0xFFFFFF) << 8
2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to 
chop back to 32 bits, once per iteration.

> 
> 
>> What is the purpose of the calls to long() in PJWHash?
>>
> trying to cast to long, looking at it now its rather superfluous.
> 
> 
>> And the $64K question: What is the quintessential difference between
>> PJWHash and ELFHash?
>>
> Nothing, elf is essentially pjw, its just optimised for 32-bit systems
> in that the calculation for th's etc are static where has pjw
> required sizeof to calc the th's

You've found a C compiler where sizeof(unsigned int) is not static i.e. 
calculated by the compiler at compile time???

> which i couldn't find a way of doing,
> so i fudged it in the hope that maybe sometime in the future a work
> around of sorts could be developed.

Google is a wonderful thing:

http://users.physik.tu-muenchen.de/gammel/matpack/html/LibDoc/Strings/strings.html
"""
Thanks to Josh Bloch (jbloch at eng.sun.com) who also informed me about 
another fault that is found in Aho, Sethi and Ullman's book: The line 
with h ^= (g >> 28) now replaces the original h ^= (g >> 24). According 
to a correspondence of Josh Bloch with Ravi Sethi this correction will 
be made in the next edition of the book. Including these two changes 
this hash function is now comparable to Vo's, Torek's and WAIS's hash 
functions.
"""

(1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to 
your website?

http://www.math.columbia.edu/~bayer/annote/root/root.html
"""
Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by 
Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436.

Hash unsigned X into H, using the temporary variable G. G and H are 
unsigned variables; X may be an expression. G is nonzero e.g. 92% of 
time, so a conditional expression would be slower. As noted by Josh 
Bloch, 28 is the correct replacement for the frequent misprint 24.

#define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >> 
28, H ^= G )
"""

Cheers,
John