hash() yields different results for different platforms

Qiangning Hong hongqn at gmail.com
Tue Jul 11 19:33:50 EDT 2006


I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
  insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation.  Right?

However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers!  In a pentium4,
hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344.  Does
hash function depend on machine's word length?

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit.  Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?

I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)




More information about the Python-list mailing list