space-efficient top-N algorithm

Mon Feb 10 00:29:26 EST 2003

Rene Pijlman wrote:
>>Another point: you haven't mentioned the possibility that multiple
>>distinct URLs may hash to the same digest.
> 
> Yes, good point, I forgot to mention that. This is the price to
> pay for saving memory I guess, unless someone can come up with a
> plan that avoids this lossiness and still saves memory.

You could do the checking when inserting URLs to the database. Since 
each record in the database stores (digest, url) it will be easy. When 
collision happens, several things can be done. Report to a human by 
mail, for example. Or add some characters to the URL (e.g. 
"[2]http://someurl", "[3]http://someurl") to avoid the collision. After 
that you record the colliding URL in another dictionary to prepare for 
future colliding URL.

Also note that for 128-bit digest collision is expected to be low, even 
for "Google size" (in the billions of URLs).

-- 
dave