hash() yields different results for different platforms
Grant Edwards
grante at visi.com
Wed Jul 12 10:08:06 EDT 2006
On 2006-07-12, Carl Banks <invalidemail at aerojockey.com> wrote:
> Grant Edwards wrote:
>> On 2006-07-11, Qiangning Hong <hongqn at gmail.com> wrote:
>>
>> > I'm writing a spider. I have millions of urls in a table (mysql) to
>> > check if a url has already been fetched. To check fast, I am
>> > considering to add a "hash" column in the table, make it a unique key,
>> > and use the following sql statement:
>> > insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
>> > to add new url.
>> >
>> > I believe this will be faster than making the "url" column unique key
>> > and doing string comparation. Right?
>>
>> I doubt it will be significantly faster. Comparing two strings
>> and hashing a string are both O(N).
>
> Playing Devil's Advocate: The hash would be a one-time operation during
> database insertion, whereas string comparison would happen every
> search.
Good point.
> Conceivably, comparing hash strings (which is O(1)) could
> result in a big savings compared to comparing regular strings;
Still, I doubt that the URLs are long enough so that there's a
significant difference.
> but I expect most decent sql implementations already hash data
> internally, so rolling your own hash would be useless at best.
Precisely. DB designers and implementers have been working on
this problem for 30 years. I doubt the OP is going to be able
to best them with a few minutes work.
> If the OP's database is lacking, md5 is probably fine. Perhaps
> using a subset of the md5 (the low 32 bits, say) could speed
> up comparisons at risk of more collisions. Probably a good
> trade off unless the DB is humungous.
My advice: do it the simple way first (let the DB handle it).
Don't try to fix a problem until you know it exists.
Premature optimization....
--
Grant Edwards grante Yow! It's strange, but I'm
at only TRULY ALIVE when I'm
visi.com covered in POLKA DOTS and
TACO SAUCE...
More information about the Python-list
mailing list