[Spambayes] There Can Be Only One
Greg Ward
gward@python.net
Thu, 26 Sep 2002 09:54:38 -0400
On 25 September 2002, I said:
> On 25 September 2002, Tim Peters said:
> Yes, I've been running tests all afternoon and evening. Vague,
> hand-wavey results:
>
> * my histograms are not terribly normal -- not as weird as Guido's,
> but not nearly as nice as Tim's
> * I think my peaks are better separated though -- there's a pretty
> wide range for spam_cutoff
> * I'm one of the few who seems to win by setting set spam_cutoff < 0.5
OK, here's some more detail. First, my corpus: I had to cobble together
a few sources of email in order make it into the Big Leagues, ie. have
2000 spam to play with. Here are my sources:
* python.org Sept 2002 harvest:
1895 spam
3821 normal ham
1662 dsn
5483 total ham
* spam destined for gward@python.net, but detected by SpamAssassin
and set aside by my .procmailrc on starship.python.net, from
2002-02-11 to 2002-08-07 (ie. from when I started using SA
on starship until I replaced qmail with Exim). That's 1580
spams.
There are two obvious artifacts of this spam collection: lots of
messages have "To: gward@python.net", and all of them have
"Received: ... by starship.python.net". So I counterbalanced it
with...
* the contents of all my personal inboxes as of about noon yesterday
-- 1333 hams. These have the same two artifacts as the
gward@python.net spam collection, which seems to have prevented
eg. "To: gward@python.net" being a very good clue either way.
But since they were sent on from starship to my various ISP
accounts (3 or 4 variations over the time that this mail has
been received), they were covered with "Received" headers
giving that away. So I removed all "Received" headers up to
the "Received: ... by starship.python.net", which seems to
have done the trick. There are a few messages that were sent
straight to one of my ISP addresses, but not many since I never
use them publicly.
The biggest artifact remaining is that all of the gward-spam was
received by qmail, whereas most of the other mail was received by Exim.
This leaks into the "best discriminators" list:
'received:0000' 87 0.655303
'received:HELO' 96 0.725595
'received:unknown' 96 0.746786
Almost every initial "Received" line in my gward-spam collection looks
like
Received: from unknown (HELO yahoo.com) (218.232.230.20)
by starship.python.net with SMTP; 28 Apr 2002 19:12:41 -0000
If Exim had received the same message, it would slap on something like
this instead:
Received: from [218.232.230.20] (helo=yahoo.com)
by starship.python.net with smtp (Exim 4.05)
id 17sNGM-0004Ch-00
for gward@python.net; Sun 28 Apr 2002 15:12:41 -0400
so the above three clues mean that "qmail receives more spam than ham"
-- because 100% (1580 msgs) of gward-spam is spam received by qmail, and
1077/1333 messages in gward-ham were received by qmail.
The flip side of the coin:
'received:helo' 133 0.38154
'received:for' 163 0.393567
'received:4.05' 166 0.389087
'received:0400' 167 0.397625
'received:esmtp' 177 0.163725
'received:Exim' 183 0.364903
mean "Exim receives a bit more ham than spam" -- unless the protocol is
ESMTP, which is a fairly good ham indicator. That probably *is* a valid
clue, rather than a qmail-vs-Exim artifact. Interesting.
Anyways, that little exploration has me wondering just how valid my data
is. I should probably rerun everything without looking at "Received"
headers at all (except to count them -- for the most part, they stop at
either mail.python.org or starship.python.net, which are the front-line
servers for these two collections).
Right, onto the results. First, I'm a little unclear on how everyone
else has been generating the results they've been posting. I did this:
[...several test runs...]
timcv.py -n10 --ham=200 --spam=200 -s54321 > timcv-run5.log
[...tweak .ini file a couple of times...]
timcv.py -n10 --ham=200 --spam=200 -s54321 > timcv-run7.log
run5 is Graham, run7 is f(w) with spam_cutoff=0.475 (based on the
results of run6, which I'm not showing here).
Then I ran
rates.py timcv-run5.log
rates.py timcv-run7.log
and then
cmp.py timcv-run5.logs.txt timcv-run7.logs.txt
which gives me this:
"""
timcv-run5.logs.txt -> timcv-run7.logs.txt
[...]
false positive percentages
0.000 0.500 lost +(was 0)
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
1.000 0.000 won -100.00%
0.000 0.000 tied
0.000 0.000 tied
0.500 0.000 won -100.00%
1.000 1.000 tied
won 2 times
tied 7 times
lost 1 times
total unique fp went from 6 to 4 won -33.33%
mean fp % went from 0.3 to 0.2 won -33.33%
false negative percentages
2.000 2.000 tied
1.500 2.500 lost +66.67%
1.000 2.000 lost +100.00%
1.500 2.000 lost +33.33%
1.000 0.500 won -50.00%
0.500 1.500 lost +200.00%
1.000 1.000 tied
2.000 1.500 won -25.00%
1.000 1.000 tied
1.000 1.500 lost +50.00%
won 2 times
tied 3 times
lost 5 times
total unique fn went from 25 to 31 lost +24.00%
mean fn % went from 1.25 to 1.55 lost +24.00%
ham mean ham sdev
0.23 20.68 +8891.30% 3.20 8.87 +177.19%
0.00 20.99 +(was 0) 0.00 9.40 +(was 0)
0.00 20.02 +(was 0) 0.00 8.17 +(was 0)
0.50 21.70 +4240.00% 7.03 9.84 +39.97%
0.00 20.59 +(was 0) 0.00 8.91 +(was 0)
0.99 20.70 +1990.91% 9.89 10.06 +1.72%
0.00 19.02 +(was 0) 0.00 8.47 +(was 0)
0.00 19.41 +(was 0) 0.00 8.65 +(was 0)
0.50 18.99 +3698.00% 7.05 7.96 +12.91%
1.00 21.73 +2073.00% 9.95 10.64 +6.93%
ham mean and sdev for all runs
0.32 20.38 +6268.75% 5.55 9.18 +65.41%
spam mean spam sdev
98.41 80.22 -18.48% 11.97 11.30 -5.60%
98.48 79.66 -19.11% 12.15 11.51 -5.27%
99.00 79.41 -19.79% 9.95 12.24 +23.02%
98.50 79.70 -19.09% 12.16 11.44 -5.92%
99.00 79.93 -19.26% 9.95 11.49 +15.48%
99.50 79.48 -20.12% 7.05 10.94 +55.18%
99.00 80.18 -19.01% 9.95 11.01 +10.65%
98.00 78.66 -19.73% 14.00 11.04 -21.14%
99.00 78.52 -20.69% 9.95 11.55 +16.08%
99.00 79.80 -19.39% 9.95 11.25 +13.07%
spam mean and sdev for all runs
98.79 79.56 -19.47% 10.87 11.40 +4.88%
ham/spam mean difference: 98.47 59.18 -39.29
"""
Here are the histograms for run5 (Graham):
* = 34 items
0.00 1993 ***********************************************************
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 1 *
47.50 0
50.00 0
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 6 *
-> <stat> Spam scores for all runs: 2000 items; mean 98.79; sdev 10.87
* = 33 items
0.00 23 *
2.50 0
5.00 1 *
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 0
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 1 *
85.00 0
87.50 0
90.00 1 *
92.50 0
95.00 1 *
97.50 1973 ************************************************************
and for run7 (Robinson f(w)):
-> <stat> Ham scores for all runs: 2000 items; mean 20.38; sdev 9.18
* = 5 items
0.00 0
2.50 41 *********
5.00 96 ********************
7.50 144 *****************************
10.00 148 ******************************
12.50 146 ******************************
15.00 206 ******************************************
17.50 202 *****************************************
20.00 253 ***************************************************
22.50 212 *******************************************
25.00 173 ***********************************
27.50 124 *************************
30.00 77 ****************
32.50 51 ***********
35.00 41 *********
37.50 26 ******
40.00 13 ***
42.50 13 ***
45.00 11 ***
47.50 11 ***
50.00 4 *
52.50 4 *
55.00 1 *
57.50 3 *
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 0
-> <stat> Spam scores for all runs: 2000 items; mean 79.56; sdev 11.40
* = 3 items
0.00 0
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 2 *
40.00 1 *
42.50 5 **
45.00 4 **
47.50 6 **
50.00 6 **
52.50 7 ***
55.00 18 ******
57.50 41 **************
60.00 47 ****************
62.50 68 ***********************
65.00 105 ***********************************
67.50 115 ***************************************
70.00 160 ******************************************************
72.50 141 ***********************************************
75.00 149 **************************************************
77.50 128 *******************************************
80.00 118 ****************************************
82.50 154 ****************************************************
85.00 159 *****************************************************
87.50 171 *********************************************************
90.00 119 ****************************************
92.50 85 *****************************
95.00 75 *************************
97.50 116 ***************************************
-> best cutoff for all runs: 0.5
-> with 12 fp + 18 fn = 30 mistakes
Oops, just noticed the "best cutoff for all runs" thing. I must have
misinterpreted the run6 output -- picking 0.475 was an eyeball average.
D'ohh.
Off to re-run things, without mine_received and with a better
spam_cutoff.
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
A closed mouth gathers no foot.