hash() yields different results for different platforms

Tue Jul 11 20:07:52 EDT 2006

On 2006-07-11, Qiangning Hong <hongqn at gmail.com> wrote:

> I'm writing a spider. I have millions of urls in a table (mysql) to
> check if a url has already been fetched. To check fast, I am
> considering to add a "hash" column in the table, make it a unique key,
> and use the following sql statement:
>   insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> to add new url.
>
> I believe this will be faster than making the "url" column unique key
> and doing string comparation.  Right?

I doubt it will be significantly faster.  Comparing two strings
and hashing a string are both O(N).

> However, when I come to Python's builtin hash() function, I
> found it produces different values in my two computers!  In a
> pentium4, hash('a') -> -468864544; in a amd64, hash('a') ->
> 12416037344.  Does hash function depend on machine's word
> length?

Apparently. :)

The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>>> hex(12416037344)
'0x2E40DB1E0L'
>>> hex(-468864544 & 0xffffffffffffffff)
'0xFFFFFFFFE40DB1E0L'

>>> hex(12416037344 & 0xffffffff)
'0xE40DB1E0L'
>>> hex(-468864544 & 0xffffffff)
'0xE40DB1E0L'

-- 
Grant Edwards                   grante             Yow!  Uh-oh!! I forgot
                                  at               to submit to COMPULSORY
                               visi.com            URINALYSIS!