[Spambayes] For the bold

Sat, 05 Oct 2002 03:18:16 -0400

There's one more "central limit" scheme on the table now:
use_central_limit3.  The spamprob() code is identical to use_central_limit2,
but the ham and spam populations are computed differently.

Under central_limit2, the spam population is computed like so:

for each msg in the training spam:
    for each extreme word w in msg:
        if we haven't seen w before:
            add ln(prob(w)) to the spam population

The ham population is computed similarly, except using ln(1-prob(w))
instead.

Under central_limit3, the spam population is composed of whole-msg scores,
not of individual word scores:

for each msg in the training spam:
    compute the mean of ln(prob(w)) over the extreme words w in msg
    add that mean to the spam population

And likewise for the ham population, using ln(1-prob(w)) instead.

There's not even a ghost of an illusion that the central limit theorem
applies to this variant, but the spamprob() code remains identical, happily
ignoring that it's utterly unjustified <wink>.

Still, brief preliminary tests suggest this *may* actually work better.
Here's the bottom line for a run training against 5000 ham + 5000 spam, then
predicting against 5000 of each:

-> <stat> Ham scores for all runs: 5000 items; mean 0.09; sdev 2.31
-> <stat> min 0; median 0; max 100
* = 82 items
  0 4992 *************************************************************
 25    7 *
 50    0
 75    1 *  this was the Nigerian scam spam

-> <stat> Spam scores for all runs: 5000 items; mean 99.68; sdev 4.07
-> <stat> min 0; median 100; max 100
* = 82 items
  0    1 *  this was the spam with a uuencoded body we ignore
 25    6 *
 50   24 *
 75 4969 *************************************************************

The advantage-- if it's real --is that it's certain more often.

The populations are sharply separated:

ham ham mean: 5000 items; mean -0.35; sdev 0.20
-> <stat> min -3.55286; median -0.316515; max -0.00523756

spam ham mean: 5000 items; mean -3.87; sdev 0.92
-> <stat> min -6.03683; median -3.857; max -1.22996

That is, when we score a ham using the ham ln(1-prob) rule, the mean msg
mean is -0.35 with a small sdev of 0.20.  But when we score a spam using the
ham ln(1-prob) rule, the mean msg mean is -3.87, with a larger sdev.

Another pair of results says what happens when we score ham and spam using
the spam ln(prob) rule:

ham spam mean: 5000 items; mean -3.02; sdev 0.71
-> <stat> min -5.72426; median -2.91819; max -0.602309

spam spam mean: 5000 items; mean -0.11; sdev 0.14
-> <stat> min -2.23055; median -0.0546932; max -0.00268306

It's essentially impossible for a msg to score well under both measures, but
it's easy for a msg to score poorly under both measures.  The most
appropriate rule again appears to be that it doesn't matter how poorly a msg
scores, it only matters how much more poorly it scores under the other
measure.