hash() yields different results for different platforms

Grant Edwards grante at visi.com
Wed Jul 12 10:08:06 EDT 2006


On 2006-07-12, Carl Banks <invalidemail at aerojockey.com> wrote:
> Grant Edwards wrote:
>> On 2006-07-11, Qiangning Hong <hongqn at gmail.com> wrote:
>>
>> > I'm writing a spider. I have millions of urls in a table (mysql) to
>> > check if a url has already been fetched. To check fast, I am
>> > considering to add a "hash" column in the table, make it a unique key,
>> > and use the following sql statement:
>> >   insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
>> > to add new url.
>> >
>> > I believe this will be faster than making the "url" column unique key
>> > and doing string comparation.  Right?
>>
>> I doubt it will be significantly faster.  Comparing two strings
>> and hashing a string are both O(N).
>
> Playing Devil's Advocate: The hash would be a one-time operation during
> database insertion, whereas string comparison would happen every
> search.

Good point.

> Conceivably, comparing hash strings (which is O(1)) could
> result in a big savings compared to comparing regular strings;

Still, I doubt that the URLs are long enough so that there's a
significant difference.

> but I expect most decent sql implementations already hash data
> internally, so rolling your own hash would be useless at best.

Precisely.  DB designers and implementers have been working on
this problem for 30 years.  I doubt the OP is going to be able
to best them with a few minutes work.

> If the OP's database is lacking, md5 is probably fine. Perhaps
> using a subset of the md5 (the low 32 bits, say) could speed
> up comparisons at risk of more collisions.  Probably a good
> trade off unless the DB is humungous.

My advice: do it the simple way first (let the DB handle it).
Don't try to fix a problem until you know it exists.

Premature optimization....

-- 
Grant Edwards                   grante             Yow!  It's strange, but I'm
                                  at               only TRULY ALIVE when I'm
                               visi.com            covered in POLKA DOTS and
                                                   TACO SAUCE...



More information about the Python-list mailing list