General Hash Functions In Python

John Machin sjmachin at lexicon.net
Mon Jul 17 07:00:46 EDT 2006


On 17/07/2006 5:52 PM, Arash Partow wrote:
> Hi Paul,
> 
> For different data types different hash functions work
> better/worse aka fewer or more collisions.
> 
> I believe the more choice people have and also the more
> ways people see how a particular thing can be done, then
> the easier it will be for them to come up with their own
> specific efficient solutions.

Who is likely to bother? In timbot we trust. Have you read the comments 
at the start of Objects/dictobject.c?

> 
> That said, I believe at least one (most likely more) of
> the hash functions in the group above will most always work
> better (ala less collisions) than the standard default hash
> available in the built-in dict for any random set of strings.

[Aside: it's not "in the built-in dict". Any type which wants its 
instances to be hashable defines its own hash method, one that suits the 
type.]

This belief would be based on:
(a) actual testing by you
or (b) a refereed academic paper which did such tests on hash functions 
(including the Python "standard default hash")
or (c) ...???

What is the Python "standard default hash", anyway? It is written in C. 
Wouldn't it have been a good idea to provide a Python function for it, 
so that people who are going to "come up with their own specific 
efficient solutions" had something to compare with? Or perhaps they are 
intended to "port" your functions back into C?

A few more questions:

Why have you truncated the answers to 31 bits?

Have you tested that these functions produce the same output (apart from 
the 31-bit thing) as the originals that you worked from? The reason for 
asking is that Python unlike C doesn't lose any bits in the << 
operation; if this is followed by a >> you may be shifting some unwanted 
  1-bits back in again.

Talking about not losing bits: For your 36-byte example input, the 
SDBMHash (with its << 16) is up to about 566 bits before you truncate it 
to 31. A little over the top, perhaps. Maybe not the fastest way of 
doing it.

What is the purpose of the calls to long() in PJWHash?

And the $64K question: What is the quintessential difference between 
PJWHash and ELFHash?

Cheers,
John



More information about the Python-list mailing list