space-efficient top-N algorithm
David Garamond
lists at zara.6.isreserved.com
Sun Feb 9 20:24:37 EST 2003
Rene Pijlman wrote:
>>However, the number of URLs are large and some of the URLs are long
>>(>100 characters). My process grows into more than 100MB in size. I
>>already cut the URLs to a max of 80 characters before entering them into
>> the dictionary, but it doesn't help much.
>
> You could consider hashing the URL to a digest, using the md5 or
> sha module for example. But then you would need to make a second
> pass over the log file to translate the top-50 digests to their
> URLs.
Yes, that's a great idea. Thanks! And I should have thought about it
too, since I remember reading the Google whitepaper some years ago where
they did the same technique. (But then, since RAM and disk are so cheap
nowadays, I seldom use my own memory anymore... :-) )
--
dave
More information about the Python-list
mailing list