[Spambayes] RE: Central Limit Theorem??!! :)

Tim Peters tim.one@comcast.net
Mon, 23 Sep 2002 23:22:13 -0400


[Gary Robinson, outlines another central limit scheme, but using
 ln(f(w)) for spam scores and ln(1-f(w)) for ham scores when
 building the ham and spam "extreme word" populations, then computing
 two corresponding log sums when scoring.
]

> ...
> This *should* bring us the benefits of the multiplicative approach
> and of the parametric stats approach at the same time.

My first "real test" says it's a winner!  See below.

> The downside of this is that the ln's make such a skewed distribution
> that it should take a bigger n to make the central limit theorem kick
> in.

This isn't a good time to mention it <wink>, but the original central-limit
code worked better when I reduced the # of extremes from 30 to 15.  Note
that these are the kinds of oddities I spent countless hours "tuning away"
on the Graham scheme, though, and I;ve still done almost nothing of that
nature for your schemes.

> BUT, OTOH, it WILL kick in, and something like 30 may really still
> be enough (it usually kicks in at significantly smaller numbers).

I predict more experiments on the horizon.

> And you've also successfully used n=150 with f(w)

I've also used n=15 and n=1500, BTW.  150->1500 neither helped nor hurt.
150->15 gave mixed results, too close to call across the few small tests I
ran.

> and that is DEFINITELY enough.
>
> The other downside is that it just seems like a bit  of a wild thing
< to do and sometimes when you do wild things, strange reasons emerge
> why they won't work.

It's an extreme problem, though.  I ran experiments long ago training on a
single ham and a single ham:  that was enough to get about 30% fp rate and
about 20% fn rate.  *Much* better than flipping a coin!  Training on two of
each did even better <wink>.

 But I really can't see any at this point as long as n is big enough
> that the sample means take on a normal distribution.
>
> THANKS for doing all the coding work to test this idea!!!!  :)

Thank me for the testing, if you have to thank me at all -- the coding is
fun, the testing is mind-numbing <drool>.

Speaking of which, the 3-pass nature of training under this scheme doesn't
work well with my cross-validation driver.  Instead the test here is on a
5x5 test grid, skipping the diagonal.  I picked 5 disjoint sets of 1000 ham
each at random, and 5 disjoint sets of 1000 spam each at random.  These were
then paired into 5 collections, 1 thru 5.  First I trained on pair 1, then
predicted on pairs 2, 3, 4, and 5.  Then I trained a new classifier on pair
2, and predicted on pairs 1, 3, 4, and 5.  Etc.  In all we get 20 test runs
out of this, and each message is predicted 4 times.

In both runs, 30 extremes were used, and f(w) with a=0.1.  The original
central limit code is on the left, and the logarithmic version on the right.
I haven't done anything to do a more sensible scaling of scores into [0, 1],
so I'll skip histograms.

-> <stat> tested 1000 hams & 1000 spams against 1000 hams & 1000 spams
   [ditto 39 times]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.200  0.200  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.100  0.100  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.100  0.100  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.100  0.000  won   -100.00%
    0.100  0.100  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   1 times
tied 19 times
lost  0 times

total unique fp went from 3 to 2 won    -33.33%
mean fp % went from 0.03 to 0.025 won    -16.67%

false negative percentages
    2.200  0.800  won    -63.64%
    2.100  0.900  won    -57.14%
    1.700  0.600  won    -64.71%
    1.900  0.400  won    -78.95%
    1.500  1.100  won    -26.67%
    1.900  0.700  won    -63.16%
    1.600  0.600  won    -62.50%
    1.600  0.700  won    -56.25%
    1.400  0.600  won    -57.14%
    1.900  0.400  won    -78.95%
    1.600  0.700  won    -56.25%
    1.600  0.400  won    -75.00%
    1.900  0.600  won    -68.42%
    1.600  0.500  won    -68.75%
    1.300  0.500  won    -61.54%
    1.600  0.300  won    -81.25%
    1.300  0.400  won    -69.23%
    1.900  0.400  won    -78.95%
    1.300  0.400  won    -69.23%
    1.400  0.300  won    -78.57%

won  20 times
tied  0 times
lost  0 times

total unique fn went from 125 to 54 won    -56.80%
mean fn % went from 1.665 to 0.565 won    -66.07%

So the logarithmic version was indeed a huge and pure win for the fn rate!

There are two other comparisons I need to make here:  against your webpage
scheme (f(w) with "S"), and against our highly fiddled Graham scheme.

First the same thing (same test msgs, etc), with log central-limit on the
left, and f(w) with a=0.1, 1500 discrimators, and
robinson_minimum_prob_strength=0.1 (a gimmick that simply ignores all words
with spamprob in (0.4, 0.6)) on the right:

[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500
robinson_minimum_prob_strength: 0.1
robinson_probability_a: 0.1
use_central_limit: False

[TestDriver]
spam_cutoff: 0.575

false positive percentages
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.200  0.200  tied
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.100  0.300  lost  +200.00%
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.100  0.200  lost  +100.00%
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.100  lost  +(was 0)
    0.100  0.200  lost  +100.00%
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied

won   0 times
tied 10 times
lost 10 times

total unique fp went from 2 to 6 lost  +200.00%
mean fp % went from 0.025 to 0.08 lost  +220.00%

false negative percentages
    0.800  0.100  won    -87.50%
    0.900  0.000  won   -100.00%
    0.600  0.200  won    -66.67%
    0.400  0.200  won    -50.00%
    1.100  0.100  won    -90.91%
    0.700  0.100  won    -85.71%
    0.600  0.200  won    -66.67%
    0.700  0.100  won    -85.71%
    0.600  0.000  won   -100.00%
    0.400  0.000  won   -100.00%
    0.700  0.200  won    -71.43%
    0.400  0.200  won    -50.00%
    0.600  0.100  won    -83.33%
    0.500  0.100  won    -80.00%
    0.500  0.000  won   -100.00%
    0.300  0.100  won    -66.67%
    0.400  0.100  won    -75.00%
    0.400  0.100  won    -75.00%
    0.400  0.000  won   -100.00%
    0.300  0.300  tied

won  19 times
tied  1 times
lost  0 times

total unique fn went from 54 to 11 won    -79.63%
mean fn % went from 0.565 to 0.11 won    -80.53%

Alas, it turns out it would have been better to set spam_cutoff to 0.6 on
this run.  That would have cut 9 instances of fp and added 33 instances of
fn (note that since each msg is scored 4 times, it's *not* necessarily the
case that these deltas would reflect directly in the "total unique" counts),
leaving the two roughly comparable on this test.

I predict (but haven't yet tried it) that I'd get an improvement in the
central-limit scheme by systematically ignoring low-value words too.

Finally, the same thing with log central limit on the left, and our Graham
scheme on the right (16 discriminators, spam cutoff 0.90).

false positive percentages
    0.000  0.000  tied
    0.000  0.300  lost  +(was 0)
    0.000  0.000  tied
    0.200  0.200  tied
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.100  0.200  lost  +100.00%
    0.000  0.000  tied
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.100  0.200  lost  +100.00%
    0.000  0.100  lost  +(was 0)
    0.000  0.100  lost  +(was 0)
    0.000  0.100  lost  +(was 0)
    0.100  0.300  lost  +200.00%
    0.000  0.100  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.300  lost  +(was 0)
    0.000  0.000  tied

won   0 times
tied  9 times
lost 11 times

total unique fp went from 2 to 10 lost  +400.00%
mean fp % went from 0.025 to 0.105 lost  +320.00%

false negative percentages
    0.800  0.100  won    -87.50%
    0.900  0.100  won    -88.89%
    0.600  0.300  won    -50.00%
    0.400  0.200  won    -50.00%
    1.100  0.300  won    -72.73%
    0.700  0.400  won    -42.86%
    0.600  0.300  won    -50.00%
    0.700  0.100  won    -85.71%
    0.600  0.100  won    -83.33%
    0.400  0.000  won   -100.00%
    0.700  0.200  won    -71.43%
    0.400  0.200  won    -50.00%
    0.600  0.100  won    -83.33%
    0.500  0.100  won    -80.00%
    0.500  0.100  won    -80.00%
    0.300  0.200  won    -33.33%
    0.400  0.100  won    -75.00%
    0.400  0.100  won    -75.00%
    0.400  0.200  won    -50.00%
    0.300  0.200  won    -33.33%

won  20 times
tied  0 times
lost  0 times

total unique fn went from 54 to 16 won    -70.37%
mean fn % went from 0.565 to 0.17 won    -69.91%

All of Gary's schemes beat the fiddled Graham scheme on fp rate here, but
the Graham scheme remains unbeaten on the fn rate.  (I should add, of
course, "as implemented by Tim" to all of those.)