[Spambayes] Testing methodology

Sat, 14 Sep 2002 14:10:59 -0400

[Neil Schemenauer, running a 6-fold cross validation]

> Here are my results using the default options:
>
> [assorted lines elided by tim]
>
> -> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams
>       0.333   0.667
> -> <stat> 1 new false positives
> -> <stat> 2 new false negatives

> -> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams
>       0.000   1.333
> -> <stat> 0 new false positives
> -> <stat> 4 new false negatives

> -> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams
>       1.000   1.667
> -> <stat> 3 new false positives
> -> <stat> 5 new false negatives

> -> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams
>       0.333   0.333
> -> <stat> 1 new false positives
> -> <stat> 1 new false negatives

> -> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams
>       0.000   2.000
> -> <stat> 0 new false positives
> -> <stat> 6 new false negatives

> -> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams
>       0.000   2.000
> -> <stat> 0 new false positives
> -> <stat> 6 new false negatives

> total unique false pos 5
> total unique false neg 24

> average fp % 0.277777777778
> average fn % 1.33333333333

On a similar 6-fold all-defaults run, also with

-> <stat> tested 300 hams & 300 spams against 1500 hams & 1500 spams

my bottom line was

total unique false pos 3
total unique false neg 6
average fp % 0.166666666667
average fn % 0.333333333333

If I take recieved, return-path, and in-reply-to out of the default
safe-headers option, it gets worse:

total unique false pos 3
total unique false neg 9
average fp % 0.166666666667
average fn % 0.5

I still have a nagging suspicion that just counting received and return-path
instances is discriminating between c.l.py and bruceg's spam for bogus
reasons.

> The fp are mostly pretty questionable things (e.g. a joke in the form of
> a sales pitch forwarded to me).  Turning on mine_received_headers gives
> a small improvement:

I disagree!  It's a major improvement.  Indeed, cutting the f-n rate by a
third, and without harming the f-p rate, is a huge win.  From my time in
speech recog, I've been very careful all along to quote error rates instead
of accuracy rates, because it seems psychologically impossible for people to
grok that improving accuracy from, say, 98% to 99% is a spectacular
improvement.  Phrasing that as cutting the error rate from 2% to 1% says
exactly the same thing, but for whatever reason makes it much clearer that
performance has gotten twice as good.  Alas, when the error rates get under
1% (which they never did in speech recog), it appears it's also
psychologically impossible not to think of them as "well, two small numbers
are pretty alike -- no big deal either way".  But cutting an error rate in
half is still a spectacular improvement, and is actually harder to do the
smaller the absolute rate to begin with.  Cutting one by a third is merely a
huge win <wink>.

> false positive percentages
>     0.333  0.333  tied
>     0.000  0.000  tied
>     1.000  1.000  tied
>     0.333  0.000  won   -100.00%
>     0.000  0.000  tied
>     0.000  0.000  tied
>
> won   1 times
> tied  5 times
> lost  0 times
>
> total unique fp went from 5 to 4 won    -20.00%
> mean fp % went from 0.277777777778 to 0.222222222222 won    -20.00%
>
> false negative percentages
>     0.667  0.667  tied
>     1.333  0.667  won    -49.96%
>     1.667  1.000  won    -40.01%
>     0.333  0.333  tied
>     2.000  1.667  won    -16.65%
>     2.000  1.000  won    -50.00%
>
> won   4 times
> tied  2 times
> lost  0 times
>
> total unique fn went from 24 to 16 won    -33.33%
> mean fn % went from 1.33333333333 to 0.88888888889 won    -33.33%

That's a pure win, and a big one.