[Spambayes] RE: Central Limit Theorem??!! :)

Tim Peters tim.one@comcast.net
Mon, 23 Sep 2002 13:11:17 -0400


[T. Alexander Popiel]
> This can still be done in one pass.  While calculating mean and stddev,
> keep a running list of the 30 highest and 30 lowest f(w), then at the end
> figure out which of those 60 are most extreme.  (Assuming, of course,
> that I'm properly interpreting what is meant by 'extreme'.)

OK, it's really a three-pass process.

You don't know what any of the f(w) are until the first pass is complete.
Computing f(w) depends on first having seen every message that contains w.

Computing f(w) is then a second pass, but doesn't require the source of
messages, it just needs the accumulated counts from the first pass.

The third pass is needed to find the 30 extremes for *each* training message
(as Gary said).  This again requires knowing the set of words contained in
each message.

If you only put a grand total of 30 extremes in each of the ham and spam
populations, then, e.g., under Graham's p(w) all of the ham extremes will
almost certainly be 0.01, and all of the spam extremes 0.99, and the
population variances will be 0, rendering the scheme useless (the z-score
for any message whose mean isn't exactly 0.01 or 0.99 will be infinite).
Under Gary's f(w), it will approach that the more messages you train on.

> ...
> The formulas in elementary statistics books certainly aren't good for
> this, but the recurrence relations in Knuth's TAOCP vol 2 (Seminumerical
> Algorithms) are quite good for the incremental approach.  Quoting from
> 4.2.2, equations 15 & 16:
>
>   (15)  M[1] = x[1],  M[k] = M[k-1] + ((x[k] - M[k-1]) / k)
>   (16)  S[1] = 0,     S[k] = S[k-1] + ((x[k] - M[k-1]) * (x[k] - M[k]))
>
>   for a population x of size n, 2 <= k <= n, where
>   sigma = sqrt(S[n]/(n-1)).
>
> These formulas can be run forwards or backwards with equal grace,

[16] can be run backwards gracefully, but I'm not sure about [15].  In the
forward direction, [15] wins because the rounding error in the subtraction
is effectively divided by k.  I believe

    M[k] = M[k+1] + (M[k+1] - x[k+1]) / k

is a numerically robust way to run [15] backwards, though.

> and (if coded as written without 'simplification') handle round-off
> errors well, too.  (If greater precision is needed, the precision of the
> sums can be doubled through use of the formulas in exercise 19 of the
> same chapter.)
>
> IANAS, but I trust Knuth. ;-)

For exercise 19, you trust Kahan, and he's more trustworthy here than Knuth
by a long shot <wink>.  I'm not going to bother until we know better whether
this scheme works (Python's unbounded ints are builtin, and fast, and using
them instead obviously loses no significant information).