hash() yields different results for different platforms

Paul Rubin http
Tue Jul 11 20:07:12 EDT 2006


"Qiangning Hong" <hongqn at gmail.com> writes:
> However, when I come to Python's builtin hash() function, I found it
> produces different values in my two computers!  In a pentium4,
> hash('a') -> -468864544; in a amd64, hash('a') -> 12416037344.  Does
> hash function depend on machine's word length?

The hash function is unspecified and can depend on anything the
implementers feel like.  It may(?) even be permitted to differ from
one run of the interpreter to another (I haven't checked the spec for
this).  Don't count on it being consistent from machine to machine.

> If it does, I must consider another hash algorithm because the spider
> will run concurrently in several computers, some are 32-bit, some are
> 64-bit.  Is md5 a good choice? Will it be too slow that I have no
> performance gain than using the "url" column directly as the unique key?

If you're going to accept the overhead of an SQL database you might as
well enjoy the use of the abstraction it gives you, instead of trying
to implement what amounts to your own form of indexing instead of
letting the db take care of it.  But md5(url) is certainly very fast
compared with processing the outgoing http connection that you
presumably plan to open for each url.

> I will do some benchmarking to find it out. 

That's the right way to answer questions like this.



More information about the Python-list mailing list