General Hash Functions In Python

Arash Partow partow at gmail.com
Mon Jul 17 23:24:13 EDT 2006


John Machin wrote:
> Perhaps you should, if you profess an interest in hashed lookup -- it
> gives some interesting commentary on the second aspect: collision
> handling. What matters is the *total* time to return an answer.
>
Time or algorithm complexity is merely one aspect of a hash function
design. there are many others.

> > as far as time, well
> > people interested, as is how started my original port, would be more
> > than willing to try/assess the routines for sets of strings that they
> > wish to hash etc.
>
> > this site may help explain plus has some code
> > snippets that may help you understand what I mean.
> >
> > http://www.partow.net/programming/hashfunctions/index.html
>
> That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's
> a Melbourne academic named Justin Zobel who has done something amazingly
> similar:
>
> http://goanna.cs.rmit.edu.au/~hugh/zhw-ipl.html
>
Same guy, he was a lecturer during my uni days. As far as his surname
that is another issue altogether.


> Searching for "Justin Sobel" did lead me to a Russian website which
> apart from repeating your typo/reado/whatevero did propose (and test)
> some more hash functions.
>
> http://vak.ru/doku.php/proj/hash/efficiency-en
>
> In particular look at the "rot13" function which was right up near the
> front as far as number of collisions goes, and which would appear (my
> guess based on reading the source) to be very fast (with the right
> compiler (e.g. gcc 4)).
>
I've already spoken to the guy that did those measurements, there
are some flaws in the way he represents data which could lead one
to make inaccurate assumptions about the "quality" of the hash
functions. One of the many issues that i took up with him is that
he only used 1 set of data instead of having multiple sets and
aggregating the results. Whether or not he decides to re-do is
analysis is another issue altogether. TAOCP has a nice section
on how potential analysis can be done.

> I would have thought it important especially in the case of well-known
> functions whose properties have been discussed in the literature that
> you should not publish a version that gives a different answer, without
> noting that fact prominently.
>
True, the c versions work fine, i guess the python versions require
a bit more work. feel free to re-post modified versions :)


> You can't avoid using Python longs if you want to simulate unsigned
> 32-bit arithmetic. However judicious truncation can be used to stop the
> longs becoming longer and slower. General rules:
> 1. Avoid exceeding 32 bits where possible. E.g. instead of
>      hash <<= 8
> do
>      hash = (hash & 0xFFFFFF) << 8
> 2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to
> chop back to 32 bits, once per iteration.
>
>

That is something I thought about, but I also considered, as
I mentioned before, the extra bits. The more bits you have to
avalanche with - (in a very general and oversimplified way)
the better your hash function "can" be.

> Thanks to Josh Bloch (jbloch at eng.sun.com) who also informed me about
> another fault that is found in Aho, Sethi and Ullman's book: The line
> with h ^= (g >> 28) now replaces the original h ^= (g >> 24). According
> to a correspondence of Josh Bloch with Ravi Sethi this correction will
> be made in the next edition of the book. Including these two changes
> this hash function is now comparable to Vo's, Torek's and WAIS's hash
> functions.
> """
>
> (1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to
> your website?
>
Indeed! I had read about WAIS a long time ago, I'll be putting them up
very soon, thanks for the input.


> http://www.math.columbia.edu/~bayer/annote/root/root.html
> """
> Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by
> Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436.
>
> Hash unsigned X into H, using the temporary variable G. G and H are
> unsigned variables; X may be an expression. G is nonzero e.g. 92% of
> time, so a conditional expression would be slower. As noted by Josh
> Bloch, 28 is the correct replacement for the frequent misprint 24.
>
> #define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >>
> 28, H ^= G )
> """

I'm planning on adding various integer hash functions as well, just
haven't had the time. the above seems like one so I'll give it a go.





Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net




More information about the Python-list mailing list