[Spambayes] new option: generate_long_skips
Skip Montanaro
skip@pobox.com
Mon, 30 Sep 2002 17:07:49 -0500
I just checked in a new option for the tokenizer: generate_long_skips. The
default is True. I noticed when reviewing my false positives that one was
overwhelmingly dominated by these tokens (which scored very high) because it
contained an Excel spreadsheet attachment. "cutoff.ini" is
[TestDriver]
spam_cutoff: 0.4
while "noskips.ini" is
[Tokenizer]
generate_long_skips: False
[TestDriver]
spam_cutoff: 0.4
I am currently running a test with 10 sets of 200 messages per set. With 5
sets I got two more fn's and two less fp's:
cutoffs -> noskipss
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
false positive percentages
1.500 1.000 won -33.33%
4.000 4.000 tied
3.000 2.500 won -16.67%
1.500 1.500 tied
1.000 1.000 tied
won 2 times
tied 3 times
lost 0 times
total unique fp went from 22 to 20 won -9.09%
mean fp % went from 2.2 to 2.0 won -9.09%
false negative percentages
2.000 2.500 lost +25.00%
1.000 1.000 tied
0.500 0.500 tied
1.500 1.500 tied
2.000 2.500 lost +25.00%
won 0 times
tied 3 times
lost 2 times
total unique fn went from 14 to 16 lost +14.29%
mean fn % went from 1.4 to 1.6 lost +14.29%
ham mean ham sdev
22.12 21.85 -1.22% 6.01 5.74 -4.49%
23.46 23.25 -0.90% 7.31 6.93 -5.20%
23.50 23.38 -0.51% 6.64 6.51 -1.96%
23.54 23.32 -0.93% 6.88 6.87 -0.15%
23.08 22.79 -1.26% 6.77 6.62 -2.22%
ham mean and sdev for all runs
23.14 22.92 -0.95% 6.76 6.57 -2.81%
spam mean spam sdev
72.49 71.95 -0.74% 13.82 14.02 +1.45%
71.34 70.61 -1.02% 13.70 13.45 -1.82%
73.12 72.58 -0.74% 12.88 12.80 -0.62%
72.40 72.01 -0.54% 12.71 12.65 -0.47%
70.71 70.10 -0.86% 13.91 13.74 -1.22%
spam mean and sdev for all runs
72.01 71.45 -0.78% 13.44 13.37 -0.52%
ham/spam mean difference: 48.87 48.53 -0.34
I think it might be helpful for people whose ham tends to get the occasional
legitimate binary attachment. In any case, it's easier for people to test
it if I check it in. We can always remove it if it turns out not to be
generally useful. Also, it makes it easier for me to add the date mining
without interference with this change.
Note the very low spam_cutoff. That was as suggested by an earlier run.
After I switched I got a dramatic drop in fn's.
Skip