General Hash Functions In Python

Arash Partow partow at gmail.com
Mon Jul 17 08:13:20 EDT 2006


John Machin wrote:
> Who is likely to bother? In timbot we trust. Have you read the comments
> at the start of Objects/dictobject.c?
>
No I haven't probably wont be anytime soon, as far as time, well
people interested, as is how started my original port, would be more
than willing to try/assess the routines for sets of strings that they
wish to hash etc.. this site may help explain plus has some code
snippets that may help you understand what I mean.

http://www.partow.net/programming/hashfunctions/index.html


> [Aside: it's not "in the built-in dict". Any type which wants its
> instances to be hashable defines its own hash method, one that suits the
> type.]
>
> This belief would be based on:
> (a) actual testing by you
> or (b) a refereed academic paper which did such tests on hash functions
> (including the Python "standard default hash")
> or (c) ...???
>

Experience has shown me that the belief than one "default" way of
hashing is generally not the optimal way for the overwhelming
majority of data, but then again....



> What is the Python "standard default hash", anyway? It is written in C.
> Wouldn't it have been a good idea to provide a Python function for it,
> so that people who are going to "come up with their own specific
> efficient solutions" had something to compare with? Or perhaps they are
> intended to "port" your functions back into C?
>

The above link provides a very simple hash test framework, the
"standard default hash" can be placed in there and tested amongst
the other algorithms.


> A few more questions:
>
> Why have you truncated the answers to 31 bits?
>
algorithms were adapted from unsigned int, i should have removed that
final truncation in python.


> Have you tested that these functions produce the same output (apart from
> the 31-bit thing) as the originals that you worked from? The reason for
> asking is that Python unlike C doesn't lose any bits in the <<
> operation; if this is followed by a >> you may be shifting some unwanted
>   1-bits back in again.
>

Some of them do others don't (not really important unless you are
trying to be compatible with other implementations which this is
not), I am aware of python not truncating/wrapping of values under
various operations, I believe its a nice little side effect from
python which gives more bits to play with as long as you don't
truncate them as I have.


> Talking about not losing bits: For your 36-byte example input, the
> SDBMHash (with its << 16) is up to about 566 bits before you truncate it
> to 31. A little over the top, perhaps. Maybe not the fastest way of
> doing it.
>
Possibly, do you have a better solution I'm very keen to learn...


> What is the purpose of the calls to long() in PJWHash?
>
trying to cast to long, looking at it now its rather superfluous.


> And the $64K question: What is the quintessential difference between
> PJWHash and ELFHash?
>
Nothing, elf is essentially pjw, its just optimised for 32-bit systems
in that the calculation for th's etc are static where has pjw
required sizeof to calc the th's which i couldn't find a way of doing,
so i fudged it in the hope that maybe sometime in the future a work
around of sorts could be developed.

Again this was just a simple posting, I hoped to maybe gets some
comments and may present some ideas to people, I don't expect anyone
to drop everything and start using these, just thought it might be
something interesting for this NG. btw I'm not a very good python
programmer ATM.




Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net




More information about the Python-list mailing list