hash() yields different results for different platforms

Carl Banks invalidemail at aerojockey.com
Wed Jul 12 00:37:46 EDT 2006


Grant Edwards wrote:
> On 2006-07-11, Qiangning Hong <hongqn at gmail.com> wrote:
>
> > I'm writing a spider. I have millions of urls in a table (mysql) to
> > check if a url has already been fetched. To check fast, I am
> > considering to add a "hash" column in the table, make it a unique key,
> > and use the following sql statement:
> >   insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
> > to add new url.
> >
> > I believe this will be faster than making the "url" column unique key
> > and doing string comparation.  Right?
>
> I doubt it will be significantly faster.  Comparing two strings
> and hashing a string are both O(N).

Playing Devil's Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search.  Conceivably, comparing hash strings (which is O(1)) could
result in a big savings compared to comparing regular strings; but I
expect most decent sql implementations already hash data internally, so
rolling your own hash would be useless at best.

If the OP's database is lacking, md5 is probably fine.  Perhaps using a
subset of the md5 (the low 32 bits, say) could speed up comparisons at
risk of more collisions.  Probably a good trade off unless the DB is
humungous.


Carl Banks




More information about the Python-list mailing list