space-efficient top-N algorithm
Carlos Ribeiro
cribeiro at mail.inet.com.br
Mon Feb 10 19:50:06 EST 2003
On Monday 10 February 2003 06:16, Alex Martelli wrote:
> Carlos Ribeiro wrote:
> ...
>
> > It *is* possible, however, to come up with a clever implementation that
> > *statistically* selects the top N URLs, using the two heaps as auxiliar
> > structures.
>
> I shudder to think of the task of analyzing what this "statistically"
> means here. The dependency of what gets selected on the "bunching"
> characteristics of the incoming sequence of data is frightening: I
> wouldn't even know how to start characterizing the quantitative
> aspects of the "bunching" and its distorting influence on selection.
Fully agreed. In fact, I only posted it because I thought it would be nice to
throw something other than PEP308 for discussion here :-) and also, because
even knowing that the idea seems to be weird, it still may be valid in some
situations - for example, when the data set is *so* large as to make a
deterministic algorithm not viable.
> It seems to me that, unless one can get professional advice from
> some really top-notch statistician, this course of action is
> fraught with far too much risk of systematic distortion.
Depending on the data set the risk may be very high, or very small indeed. And
while the math for a full analysis seems to be hairy, a simple analysis may
still be possible. I'll take the risk to burn my tongue here - I'm no
statiscian, and statistics is known to be a perversely cruel form of
wizardry. But it seems to me that, if we have a sufficiently big collection
of URLs, and if we have a reasonably big window for the short-term cache,
then the filter will do ok in most situations (kind of 'law of big numbers').
The most complicated scenario seems to be when one have relatively few URLs
in the original log file, with a short memory window; then it may happen that
a good candidate for the top N is discarded because the short-term cache was
full, starting from zero when it appears again. But I think that, for real
files, with lots of data, this is not the most probable case, anyway; that's
why I feel that this approach may be workable.
> > This algorithm is not deterministic; it cannot guarantee that you get the
> > 'top N' elements. It depends on the size of the heaps, and also on the
> > distribution of data. But it may be worth trying in some situations,
> > specially if a deterministic approach turn out to be not viable.
>
> If deterministic wasn't viable, I would at least try to pick a
> non-deterministic one whose behavior I can analyze and explain.
> It seems to me that the already-proposed idea of using a digest
> to stand for each URL offers a better way out, though.
That's why I had originally discarded this idea from my previous reply. I just
brought it up again because it was very interesting and it didn't deserved to
die without some discussion. At least we can spend a little time far from
PEP308 :-)
Carlos Ribeiro
cribeiro at mail.inet.com.br
More information about the Python-list
mailing list