space-efficient top-N algorithm
David Garamond
lists at zara.6.isreserved.com
Mon Feb 10 00:29:26 EST 2003
Rene Pijlman wrote:
>>Another point: you haven't mentioned the possibility that multiple
>>distinct URLs may hash to the same digest.
>
> Yes, good point, I forgot to mention that. This is the price to
> pay for saving memory I guess, unless someone can come up with a
> plan that avoids this lossiness and still saves memory.
You could do the checking when inserting URLs to the database. Since
each record in the database stores (digest, url) it will be easy. When
collision happens, several things can be done. Report to a human by
mail, for example. Or add some characters to the URL (e.g.
"[2]http://someurl", "[3]http://someurl") to avoid the collision. After
that you record the colliding URL in another dictionary to prepare for
future colliding URL.
Also note that for 128-bit digest collision is expected to be low, even
for "Google size" (in the billions of URLs).
--
dave
More information about the Python-list
mailing list