[Spambayes] RE: For the bold

Sun, 06 Oct 2002 02:43:30 -0400

[Tim]
> [clt2]
> Nham= 7500
> RmsZham= 2.27249107964
> Nspam= 7500
> RmsZspam= 2.354280998
>
> [clt3]
> Nham= 7500
> RmsZham= 9.77605846416
> Nspam= 7500
> RmsZspam= 10.1887670936

[Rob Hooft]
> OOF! Under clt3 your rms values are 4x bigger! I have to look at the
> details of that:

clt1 and clt2 build ham and spam populations out of individual word
probabilities.  If the central limit theorem actually applied (which it does
not), the way zscores are computed would make sense (at least when n > 30).

clt3 builds ham and spam populations out of whole-msg scores.  The way
zscores are computed there is the same as under clt2, but it makes no sense
whatsoever under clt3.  I didn't care, because the results were at least as
good regardless; "zscores" in the hundreds are pretty common under clt3.

I think you should ignore the classifier's zscores, Rob:  *none* of them
make good sense, and under clt3 they make no sense.  The only virtue they
have is that tests say they work really well <wink -- but I can't escape
noticing that the less justification a scheme has here, the better it seems
to work!>.

> the assumption under which the rmspik.py code works is that the
> distributions of zham and zspam values are normally distributed
> if all values are "mirrored" around 0. I'll have to test that
> assumption for clt1 and clt3!

I didn't catch the meaning there, but expect any assumption you would like
to make is most likely to be true under clt1 (which is the least extreme of
these gimmicks).