[Spambayes] test sets?

Tim Peters tim.one@comcast.net
Fri, 06 Sep 2002 12:21:51 -0400


[Tim]
> ...
> Unfortunately, on my corpora it turns out to be *too* strong,
> ...

Here's what happens if I leave all the header counts in:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.100  0.025  won    -75.00%
    0.000  0.000  tied
    0.025  0.000  won   -100.00%
    0.025  0.000  won   -100.00%
    0.100  0.025  won    -75.00%
    0.025  0.000  won   -100.00%
    0.025  0.000  won   -100.00%
    0.050  0.000  won   -100.00%
    0.100  0.000  won   -100.00%
    0.025  0.000  won   -100.00%
    0.025  0.025  tied
    0.025  0.000  won   -100.00%
    0.025  0.000  won   -100.00%
    0.025  0.000  won   -100.00%
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.000  won   -100.00%
    0.100  0.025  won    -75.00%

won  14 times
tied  6 times
lost  0 times

total unique fp went from 9 to 2

false negative percentages
    0.364  0.145  won    -60.16%
    0.400  0.291  won    -27.25%
    0.400  0.364  won     -9.00%
    0.909  0.618  won    -32.01%
    0.836  0.545  won    -34.81%
    0.618  0.473  won    -23.46%
    0.291  0.291  tied
    1.018  0.654  won    -35.76%
    0.982  0.655  won    -33.30%
    0.727  0.545  won    -25.03%
    0.800  0.618  won    -22.75%
    1.163  0.872  won    -25.02%
    0.764  0.545  won    -28.66%
    0.473  0.291  won    -38.48%
    0.473  0.327  won    -30.87%
    0.727  0.509  won    -29.99%
    0.655  0.400  won    -38.93%
    0.509  0.218  won    -57.17%
    0.545  0.364  won    -33.21%
    0.509  0.436  won    -14.34%

won  19 times
tied  1 times
lost  0 times

total unique fn went from 168 to 124

A false positive *really* has to work hard then, eh?  The long quote of a
Nigerian scam letter is one of the two that made it, and spamprob() looked
at all this stuff before deciding it was spam:

prob = 0.999945196947
prob('domestic') = 0.99
prob('dollars)') = 0.99
prob('solicit') = 0.99
prob('partner.') = 0.99
prob('accounts,') = 0.99
prob('federal') = 0.99
prob('nigeria.') = 0.99
prob('ministry') = 0.99
prob('subject:Business') = 0.99
prob('overseas') = 0.99
prob('housing') = 0.99
prob('nigeria') = 0.99
prob('nigerian') = 0.99
prob('estate') = 0.99
prob('70%') = 0.99
prob('regime') = 0.99
prob('payment') = 0.99
prob('header:X-Complaints-To:1') = 0.01
prob('header:X-BeenThere:1') = 0.01
prob('header:NNTP-Posting-Host:1') = 0.01
prob('ended.') = 0.01
prob('wrote') = 0.01
prob('header:Path:1') = 0.01
prob('header:NNTP-Posting-Date:1') = 0.01
prob('header:X-Mailman-Version:1') = 0.01
prob('header:List-Id:1') = 0.01
prob('header:List-Archive:1') = 0.01
prob('header:X-Trace:1') = 0.01
prob('header:Organization:1') = 0.01
prob('header:Newsgroups:1') = 0.01
prob('header:List-Post:1') = 0.01
prob('header:References:1') = 0.01
prob('header:List-Help:1') = 0.01
prob('header:X-Newsreader:1') = 0.01
prob('states') = 0.959986
prob('united') = 0.96139
prob('money') = 0.964852
prob('country.') = 0.97034
prob('civil') = 0.96754
prob('partner') = 0.969003
prob('complex,') = 0.01
prob('funds') = 0.972142
prob('million') = 0.971369
prob('purchase') = 0.986651
prob('government') = 0.985578
prob('header:Precedence:1') = 0.0306554
prob('header:Xref:1') = 0.01
prob('header:List-Subscribe:1') = 0.01
prob('header:List-Unsubscribe:1') = 0.01
prob('header:Errors-To:1') = 0.0182013

It actually found more 0.01 clues than 0.99 ones then, but the content is
*so* bad nothing can overcome the judgment of guilt.

BTW, the false negative rate in my corpora is also getting near the point
where I won't be able to measure improvement reliably.  Since there are only
2750 spams in a spam set, 1% is 27.5 spams, whereas in the ham corpus 1% is
40 hams.  So, e.g., a f-n rate of 0.364% means a grand total of 10 false
negatives, so even changing that by 1 measly msg makes a 10% difference in
the f-n rate.