[Spambayes] Mining the headers
Skip Montanaro
skip@pobox.com
Sun Oct 27 05:37:51 2002
>> I've had three other options knocking around locally which haven't
>> seemed to help or hurt.... Should I check them in....
Alex> Yes, I'd love to test them.
Done. Note that I deleted the mine_date_headers option. It was just a
gatekeeper for the other two. Seemed pointless to me. Here's my latest
run. The first run was the default. My dates.ini file is
[Tokenizer]
generate_time_buckets: True
extract_dow: True
The results:
run1s -> datess
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
... etc ...
false positive percentages
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 5 to 5 tied
mean fp % went from 0.25 to 0.25 tied
false negative percentages
0.000 0.000 tied
0.000 0.000 tied
1.000 1.000 tied
1.000 1.000 tied
0.500 0.500 tied
1.000 0.500 won -50.00%
0.500 0.500 tied
1.500 1.500 tied
0.000 0.000 tied
2.000 2.000 tied
won 1 times
tied 9 times
lost 0 times
total unique fn went from 15 to 14 won -6.67%
mean fn % went from 0.75 to 0.7 won -6.67%
ham mean ham sdev
1.38 1.38 +0.00% 10.18 10.17 -0.10%
0.42 0.43 +2.38% 3.77 3.78 +0.27%
0.98 0.98 +0.00% 8.39 8.36 -0.36%
0.17 0.21 +23.53% 1.05 1.52 +44.76%
0.93 0.93 +0.00% 7.73 7.73 +0.00%
1.40 1.40 +0.00% 8.36 8.39 +0.36%
1.18 1.14 -3.39% 7.39 7.24 -2.03%
0.73 0.74 +1.37% 7.54 7.54 +0.00%
0.97 0.98 +1.03% 6.62 6.72 +1.51%
0.79 0.79 +0.00% 7.74 7.74 +0.00%
ham mean and sdev for all runs
0.89 0.90 +1.12% 7.32 7.32 +0.00%
spam mean spam sdev
99.17 99.16 -0.01% 4.63 4.71 +1.73%
98.65 98.66 +0.01% 6.34 6.27 -1.10%
96.71 96.71 +0.00% 13.73 13.74 +0.07%
96.74 96.73 -0.01% 13.46 13.46 +0.00%
98.44 98.46 +0.02% 9.25 9.23 -0.22%
97.35 97.36 +0.01% 12.00 11.92 -0.67%
98.33 98.34 +0.01% 9.55 9.53 -0.21%
97.17 97.17 +0.00% 13.68 13.68 +0.00%
98.94 98.93 -0.01% 6.89 6.90 +0.15%
97.46 97.45 -0.01% 13.72 13.73 +0.07%
spam mean and sdev for all runs
97.89 97.90 +0.01% 10.87 10.86 -0.09%
ham/spam mean difference: 97.00 97.00 +0.00
Here's the cost table:
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
... yeah, yeah, yeah, enough already! ...
filename: run1 dates
ham:spam: 2000:2000
2000:2000
fp total: 5 5
fp %: 0.25 0.25
fn total: 15 14
fn %: 0.75 0.70
unsure t: 93 93
unsure %: 2.33 2.33
real cost: $83.60 $82.60
best cost: $53.80 $53.60
h mean: 0.89 0.90
h sdev: 7.32 7.32
s mean: 97.89 97.90
s sdev: 10.87 10.86
mean diff: 97.00 97.00
k: 5.33 5.34
Note that my numbers seem to be getting a lot better. My ham/spam
collection has slowly gotten cleaner and I've been adding more new stuff,
not to mention which the default scheme (chi2?) seems a lot more
sensitive/accurate. I noticed that as I lopped off old messages, first
those from 1999 and before then those from 2000, that the accuracy improved.
That suggests two things to me: first, the nature of "what is spam?" has
changed a bit, and two, someone ought to test this notion. ;-)
thanks-to-uncle-timmy-for-the-extra-hour-ly, y'rs,
Skip