[Spambayes] test sets?
Tim Peters
tim.one@comcast.net
Fri, 06 Sep 2002 12:21:51 -0400
[Tim]
> ...
> Unfortunately, on my corpora it turns out to be *too* strong,
> ...
Here's what happens if I leave all the header counts in:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.100 0.025 won -75.00%
0.000 0.000 tied
0.025 0.000 won -100.00%
0.025 0.000 won -100.00%
0.100 0.025 won -75.00%
0.025 0.000 won -100.00%
0.025 0.000 won -100.00%
0.050 0.000 won -100.00%
0.100 0.000 won -100.00%
0.025 0.000 won -100.00%
0.025 0.025 tied
0.025 0.000 won -100.00%
0.025 0.000 won -100.00%
0.025 0.000 won -100.00%
0.025 0.025 tied
0.000 0.000 tied
0.025 0.000 won -100.00%
0.100 0.025 won -75.00%
won 14 times
tied 6 times
lost 0 times
total unique fp went from 9 to 2
false negative percentages
0.364 0.145 won -60.16%
0.400 0.291 won -27.25%
0.400 0.364 won -9.00%
0.909 0.618 won -32.01%
0.836 0.545 won -34.81%
0.618 0.473 won -23.46%
0.291 0.291 tied
1.018 0.654 won -35.76%
0.982 0.655 won -33.30%
0.727 0.545 won -25.03%
0.800 0.618 won -22.75%
1.163 0.872 won -25.02%
0.764 0.545 won -28.66%
0.473 0.291 won -38.48%
0.473 0.327 won -30.87%
0.727 0.509 won -29.99%
0.655 0.400 won -38.93%
0.509 0.218 won -57.17%
0.545 0.364 won -33.21%
0.509 0.436 won -14.34%
won 19 times
tied 1 times
lost 0 times
total unique fn went from 168 to 124
A false positive *really* has to work hard then, eh? The long quote of a
Nigerian scam letter is one of the two that made it, and spamprob() looked
at all this stuff before deciding it was spam:
prob = 0.999945196947
prob('domestic') = 0.99
prob('dollars)') = 0.99
prob('solicit') = 0.99
prob('partner.') = 0.99
prob('accounts,') = 0.99
prob('federal') = 0.99
prob('nigeria.') = 0.99
prob('ministry') = 0.99
prob('subject:Business') = 0.99
prob('overseas') = 0.99
prob('housing') = 0.99
prob('nigeria') = 0.99
prob('nigerian') = 0.99
prob('estate') = 0.99
prob('70%') = 0.99
prob('regime') = 0.99
prob('payment') = 0.99
prob('header:X-Complaints-To:1') = 0.01
prob('header:X-BeenThere:1') = 0.01
prob('header:NNTP-Posting-Host:1') = 0.01
prob('ended.') = 0.01
prob('wrote') = 0.01
prob('header:Path:1') = 0.01
prob('header:NNTP-Posting-Date:1') = 0.01
prob('header:X-Mailman-Version:1') = 0.01
prob('header:List-Id:1') = 0.01
prob('header:List-Archive:1') = 0.01
prob('header:X-Trace:1') = 0.01
prob('header:Organization:1') = 0.01
prob('header:Newsgroups:1') = 0.01
prob('header:List-Post:1') = 0.01
prob('header:References:1') = 0.01
prob('header:List-Help:1') = 0.01
prob('header:X-Newsreader:1') = 0.01
prob('states') = 0.959986
prob('united') = 0.96139
prob('money') = 0.964852
prob('country.') = 0.97034
prob('civil') = 0.96754
prob('partner') = 0.969003
prob('complex,') = 0.01
prob('funds') = 0.972142
prob('million') = 0.971369
prob('purchase') = 0.986651
prob('government') = 0.985578
prob('header:Precedence:1') = 0.0306554
prob('header:Xref:1') = 0.01
prob('header:List-Subscribe:1') = 0.01
prob('header:List-Unsubscribe:1') = 0.01
prob('header:Errors-To:1') = 0.0182013
It actually found more 0.01 clues than 0.99 ones then, but the content is
*so* bad nothing can overcome the judgment of guilt.
BTW, the false negative rate in my corpora is also getting near the point
where I won't be able to measure improvement reliably. Since there are only
2750 spams in a spam set, 1% is 27.5 spams, whereas in the ham corpus 1% is
40 hams. So, e.g., a f-n rate of 0.364% means a grand total of 10 false
negatives, so even changing that by 1 measly msg makes a 10% difference in
the f-n rate.