hash() yields different results for different platforms

Nick Vatamaniuc vatamane at gmail.com
Wed Jul 12 05:17:55 EDT 2006


Using Python's hash as column in the table might not be a good idea.
You just found out why. So you could instead just use the base url and
create an index based on that so next time just quickly get all urls
from same base address then do a linear search for a specific one, or
even easier, implement your own hashes without using any of the
Python's built-in hash() functions. For example, transform each
character to an int and multply them all mod 2^32-1 or something like
that. Even better I think someone already posted the Python's way of
generating hashes for string, well, just re-implement it in Python such
that your version will yield the same hash on  any platform.

Hope  this helps,
Nick V.

Qiangning Hong wrote:
> I'm writing a spider. I have millions of urls in a table (mysql) to
> check if a url has already been fetched. To check fast, I am
> considering to add a "hash" column in the table, make it a unique key,
> and use the following sql statement:
>   insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> to add new url.
>
> I believe this will be faster than making the "url" column unique key
> and doing string comparation.  Right?
>
> However, when I come to Python's builtin hash() function, I found it
> produces different values in my two computers!  In a pentium4,
> hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344.  Does
> hash function depend on machine's word length?
>
> If it does, I must consider another hash algorithm because the spider
> will run concurrently in several computers, some are 32-bit, some are
> 64-bit.  Is md5 a good choice? Will it be too slow that I have no
> performance gain than using the "url" column directly as the unique
> key?
>
> I will do some benchmarking to find it out. But while making my hands
> dirty, I would like to hear some advice from experts here. :)




More information about the Python-list mailing list