space-efficient top-N algorithm

Mon Feb 10 19:50:06 EST 2003

On Monday 10 February 2003 06:16, Alex Martelli wrote:
> Carlos Ribeiro wrote:
>    ...
>
> > It *is* possible, however, to come up with a clever implementation that
> > *statistically* selects the top N URLs, using the two heaps as auxiliar
> > structures.
>
> I shudder to think of the task of analyzing what this "statistically"
> means here.  The dependency of what gets selected on the "bunching"
> characteristics of the incoming sequence of data is frightening: I
> wouldn't even know how to start characterizing the quantitative
> aspects of the "bunching" and its distorting influence on selection.

Fully agreed. In fact, I only posted it because I thought it would be nice to 
throw something other than PEP308 for discussion here :-) and also, because 
even knowing that the idea seems to be weird, it still may be valid in some 
situations - for example, when the data set is *so* large as to make a 
deterministic algorithm not viable.

> It seems to me that, unless one can get professional advice from
> some really top-notch statistician, this course of action is
> fraught with far too much risk of systematic distortion.

Depending on the data set the risk may be very high, or very small indeed. And 
while the math for a full analysis seems to be hairy, a simple analysis may 
still be possible. I'll take the risk to burn my tongue here - I'm no 
statiscian, and statistics is known to be a perversely cruel form of 
wizardry. But it seems to me that, if we have a sufficiently big collection 
of URLs, and if we have a reasonably big window for the short-term cache, 
then the filter will do ok in most situations (kind of 'law of big numbers'). 
The most complicated scenario seems to be when one have relatively few URLs 
in the original log file, with a short memory window; then it may happen that 
a good candidate for the top N is discarded because the short-term cache was 
full, starting from zero when it appears again. But I think that, for real 
files, with lots of data, this is not the most probable case, anyway; that's 
why I feel that this approach may be workable.

> > This algorithm is not deterministic; it cannot guarantee that you get the
> > 'top N' elements. It depends on the size of the heaps, and also on the
> > distribution of data. But it may be worth trying in some situations,
> > specially if a deterministic approach turn out to be not viable.
>
> If deterministic wasn't viable, I would at least try to pick a
> non-deterministic one whose behavior I can analyze and explain.
> It seems to me that the already-proposed idea of using a digest
> to stand for each URL offers a better way out, though.

That's why I had originally discarded this idea from my previous reply. I just 
brought it up again because it was very interesting and it didn't deserved to 
die without some discussion. At least we can spend a little time far from 
PEP308 :-)

Carlos Ribeiro
cribeiro at mail.inet.com.br