space-efficient top-N algorithm

Tue Feb 11 03:08:44 EST 2003

On Tuesday 11 February 2003 01:50 am, Carlos Ribeiro wrote:
   ...
> Fully agreed. In fact, I only posted it because I thought it would be nice
> to throw something other than PEP308 for discussion here :-) and also,

Oh I'm all for that...!

> because even knowing that the idea seems to be weird, it still may be valid
> in some situations - for example, when the data set is *so* large as to
> make a deterministic algorithm not viable.

OK, let's assume it is.  I'd still like to start with an algorithm I know how 
to analyze -- for example, looking at each incoming URL with a
probability of 1/10, if the dataset is 10 times as large as I can handle.
>From sampling theory, I _know_ how this will behave...

> may still be possible. I'll take the risk to burn my tongue here - I'm no
> statiscian, and statistics is known to be a perversely cruel form of

I know just enough statistics to have ended up as the target of a
friendly but firm rebuke in an article in "American Statistician", which
explained how I had misanalyzed error estimators in my articles on
"The Bridge World" (ending up with error estimators over 10 times
larger than warranted -- i.e. my results were much more precise than
I thought).  My amateur-statistician handwaving was enough to fool
the editor (a mathematician and an old hand at statistics, but still
NOT a full-time statistician...), but fortunately there ARE professional
statisticians in the world.  Now, I'd rather not give "American
Statistician" material for ANOTHER article pointing out how subtly
one can trip oneself up with this stuff;-).

> wizardry. But it seems to me that, if we have a sufficiently big collection
> of URLs, and if we have a reasonably big window for the short-term cache,
> then the filter will do ok in most situations (kind of 'law of big
> numbers'). The most complicated scenario seems to be when one have
> relatively few URLs in the original log file, with a short memory window;
> then it may happen that a good candidate for the top N is discarded because
> the short-term cache was full, starting from zero when it appears again.
> But I think that, for real files, with lots of data, this is not the most
> probable case, anyway; that's why I feel that this approach may be
> workable.

If your "short-term cache" is large enough, no problem, but that's just
begging the issue.  Real data sets are surprising more often than not
(if I ever met one that wasn't, THEN I'd be surprised!-).  "Bunching"
of URL appearances in dense bursts, in particular, would not surprise
me AT ALL, and it can cause just the issue you mention.

That's why I want to SAMPLE the incoming data if there's too much
to process them all -- sampling theory I *KNOW* about, it's right there
in the "Statistics for Dummies" books.  And the sampling must be at
random -- not systematically every tenth incoming URL, say, else you
run into other potential anomalies if there are "repetitive waves" in
the incoming data due to underlying processes you don't know about.

If I can afford two passes on the data, timewise, I'll do a first,
sampling pass to identify candidates, then another one to count the
exact number of occurrences of each candidate.  It's still possible
that a candidate worth of inclusion in the top 50 has dropped out,
but I can _quantify_ the risk, and it does NOT depend on subtle
"dynamics" of the dataset that I can neither rule out nor fully
understand.  If two passes are prohibitively expensive, I'll use the
counts from the sampling -- not happily, but again I can quantify
the risks of errors (if I get my maths right...;-).

> > It seems to me that the already-proposed idea of using a digest
> > to stand for each URL offers a better way out, though.
>
> That's why I had originally discarded this idea from my previous reply. I
> just brought it up again because it was very interesting and it didn't
> deserved to die without some discussion. At least we can spend a little
> time far from PEP308 :-)

Yep.  It IS conceivable that digests -- while helping -- can't help by
ENOUGH, when the incoming dataset is truly huge, for example.

Of course, auxiliary storage can also help in such cases.  I wrote
up an elegant generators-based mergesort using auxiliary external
storage -- http://www.strakt.com/docs/eup02_itgen_alex.pdf --
part of a tutorial on iterators and generators; it could easily be
adapted to perform COUNTING of occurrences based on some
external storage, again.  But, here's a simpler 2-tapes idea:

Say you have enough memory to process, internally, a million
URLs at a time, and unlimited external storage (it must be
sequentially accessible -- two infinite tapes will do just fine).
Slurp in a million URLs, sort them, emit a first file with counts
and URLs sorted in URL order.  Now for N times (when the
incoming dataset has N+1 million URLs) you do:
-- slurp in a million URLS and sort them
-- rewind tape A for reading, tape B for (over)writing
-- "merge" records from tape A and from the sorted list in
   memory, writing old, new or updated records as needed
   to keep things sorted by url, onto tape B
-- logically swap the roles of tapes A and B and repeat
   these 4 steps if there are more incoming data.

You do these steps O(len incoming data) times, and they
take time proportional to O(number different URLS in the
incoming data) each time you do them -- so if the number
of different URLs is proportional to the length of incoming
data, it's O(input length, squared), no magic bullet.  Still,
it's simple, deterministic, and has good proportionality
constants, so the situation may not be truly desperate.

Alex