[spambayes-dev] [ 817813 ] Consider bad spelling a sign of spam
Tony Meyer
ta-meyer at ihug.co.nz
Wed Jan 7 18:13:30 EST 2004
The feature request says:
"""
Add a spelling checker and reasonable sized dictionary.
If more than xx% of the message is misspelled (esp
the subject), consider it to be spam. Many emails have
gotten past Spam Bayes recently because their spelling
is like "bfuqclvfphz".
"""
[http://sourceforge.net/tracker/?group_id=61702&atid=498106&func=detail&aid=
817813]
"consider it to be spam" isn't something we do, of course :) I created a
patch to generate a token that reflects the percentage of words in the
message that are in a particular (English) dictionary. So one extra token
per message, guaranteed, with a maximum of 100 new tokens.
Results:
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
[all other stat lines are more-or-less the same as this, and have been
snipped]
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.246 0.246 tied
0.000 0.000 tied
0.000 0.000 tied
0.557 0.557 tied
0.559 0.279 won -50.09%
0.287 0.287 tied
0.000 0.000 tied
won 1 times
tied 9 times
lost 0 times
total unique fp went from 6 to 5 won -16.67%
mean fp % went from 0.164881884948 to 0.136948924055 won -16.94%
false negative percentages
0.253 0.253 tied
0.781 0.781 tied
0.462 0.462 tied
0.756 0.756 tied
0.243 0.243 tied
0.247 0.247 tied
0.240 0.240 tied
0.494 0.494 tied
0.973 0.973 tied
0.454 0.454 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 20 to 20 tied
mean fn % went from 0.490257037938 to 0.490257037938 tied
ham mean ham sdev
1.18 1.18 +0.00% 7.76 7.72 -0.52%
0.99 0.99 +0.00% 6.64 6.62 -0.30%
0.84 0.85 +1.19% 6.14 6.22 +1.30%
1.99 2.03 +2.01% 9.46 9.61 +1.59%
0.49 0.51 +4.08% 3.59 3.64 +1.39%
0.85 0.85 +0.00% 5.45 5.45 +0.00%
1.16 1.17 +0.86% 9.30 9.29 -0.11%
1.20 1.20 +0.00% 8.13 8.00 -1.60%
1.55 1.56 +0.65% 8.05 8.07 +0.25%
0.47 0.48 +2.13% 3.22 3.28 +1.86%
ham mean and sdev for all runs
1.08 1.09 +0.93% 7.13 7.14 +0.14%
spam mean spam sdev
98.75 98.77 +0.02% 8.72 8.59 -1.49%
97.67 97.66 -0.01% 11.26 11.24 -0.18%
98.08 98.07 -0.01% 10.12 10.09 -0.30%
98.16 98.17 +0.01% 10.19 10.11 -0.79%
98.35 98.35 +0.00% 8.77 8.79 +0.23%
98.45 98.47 +0.02% 8.97 8.83 -1.56%
98.35 98.36 +0.01% 9.73 9.62 -1.13%
98.25 98.22 -0.03% 9.16 9.25 +0.98%
97.93 97.93 +0.00% 11.99 11.90 -0.75%
98.92 98.92 +0.00% 7.62 7.58 -0.52%
spam mean and sdev for all runs
98.30 98.30 +0.00% 9.72 9.66 -0.62%
ham/spam mean difference: 97.22 97.21 -0.01
I wondered whether 100 tokens was too many and bucketing this would help, so
I changed it to truncate to the nearest 10%. The cmp.py results are
basically the same, but here's a table.py of the three - note that with the
100 the number of unsures went up, but with 10 there was still the minor
gain with the same number of unsures.
filename: bases engs eng10s
ham:spam: 3668:4099 3668:4099 3668:4099
fp total: 6 5 5
fp %: 0.16 0.14 0.14
fn total: 20 20 20
fn %: 0.49 0.49 0.49
unsure t: 178 183 178
unsure %: 2.29 2.36 2.29
real cost: $115.60 $106.60 $105.60
best cost: $93.00 $94.20 $93.60
h mean: 1.08 1.09 1.09
h sdev: 7.13 7.14 7.15
s mean: 98.30 98.30 98.30
s sdev: 9.72 9.66 9.68
mean diff: 97.22 97.21 97.21
k: 5.77 5.79 5.78
Bigram results aren't so great. Results with the original 100 buckets:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.279 0.279 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 1 to 1 tied
mean fp % went from 0.0278551532033 to 0.0278551532033 tied
false negative percentages
0.253 0.253 tied
1.042 1.042 tied
0.693 0.693 tied
0.252 0.252 tied
0.728 0.728 tied
0.000 0.000 tied
0.481 0.481 tied
0.494 0.494 tied
0.730 0.730 tied
0.227 0.227 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 20 to 20 tied
mean fn % went from 0.489899714703 to 0.489899714703 tied
ham mean ham sdev
0.95 0.98 +3.16% 6.64 6.86 +3.31%
0.83 0.82 -1.20% 5.53 5.49 -0.72%
0.49 0.47 -4.08% 4.08 4.10 +0.49%
1.53 1.55 +1.31% 8.16 8.29 +1.59%
0.30 0.31 +3.33% 3.25 3.26 +0.31%
0.70 0.70 +0.00% 5.27 5.27 +0.00%
0.85 0.83 -2.35% 7.11 7.06 -0.70%
0.93 0.90 -3.23% 7.23 7.02 -2.90%
0.90 0.88 -2.22% 6.47 6.36 -1.70%
0.41 0.41 +0.00% 4.07 4.07 +0.00%
ham mean and sdev for all runs
0.80 0.79 -1.25% 6.01 6.01 +0.00%
spam mean spam sdev
98.71 98.74 +0.03% 7.83 7.78 -0.64%
97.38 97.36 -0.02% 12.55 12.54 -0.08%
97.78 97.77 -0.01% 11.09 11.06 -0.27%
97.89 97.87 -0.02% 10.49 10.49 +0.00%
97.90 97.94 +0.04% 10.03 9.97 -0.60%
98.32 98.29 -0.03% 8.63 8.74 +1.27%
98.19 98.21 +0.02% 10.21 10.12 -0.88%
97.68 97.56 -0.12% 10.99 11.18 +1.73%
97.86 97.88 +0.02% 11.56 11.56 +0.00%
98.73 98.72 -0.01% 7.57 7.65 +1.06%
spam mean and sdev for all runs
98.05 98.04 -0.01% 10.20 10.21 +0.10%
ham/spam mean difference: 97.25 97.25 +0.00
And with 10 buckets:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.279 0.279 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 1 to 1 tied
mean fp % went from 0.0278551532033 to 0.0278551532033 tied
false negative percentages
0.253 0.253 tied
1.042 1.042 tied
0.693 0.693 tied
0.252 0.252 tied
0.728 0.728 tied
0.000 0.000 tied
0.481 0.721 lost +49.90%
0.494 0.741 lost +50.00%
0.730 0.730 tied
0.227 0.227 tied
won 0 times
tied 8 times
lost 2 times
total unique fn went from 20 to 22 lost +10.00%
mean fn % went from 0.489899714703 to 0.538629534266 lost +9.95%
ham mean ham sdev
0.95 0.98 +3.16% 6.64 6.86 +3.31%
0.83 0.81 -2.41% 5.53 5.48 -0.90%
0.49 0.47 -4.08% 4.08 4.07 -0.25%
1.53 1.55 +1.31% 8.16 8.24 +0.98%
0.30 0.31 +3.33% 3.25 3.26 +0.31%
0.70 0.70 +0.00% 5.27 5.28 +0.19%
0.85 0.84 -1.18% 7.11 7.07 -0.56%
0.93 0.90 -3.23% 7.23 7.14 -1.24%
0.90 0.91 +1.11% 6.47 6.50 +0.46%
0.41 0.42 +2.44% 4.07 4.15 +1.97%
ham mean and sdev for all runs
0.80 0.80 +0.00% 6.01 6.03 +0.33%
spam mean spam sdev
98.71 98.74 +0.03% 7.83 7.82 -0.13%
97.38 97.39 +0.01% 12.55 12.57 +0.16%
97.78 97.80 +0.02% 11.09 11.10 +0.09%
97.89 97.91 +0.02% 10.49 10.42 -0.67%
97.90 97.96 +0.06% 10.03 9.92 -1.10%
98.32 98.29 -0.03% 8.63 8.80 +1.97%
98.19 98.20 +0.01% 10.21 10.23 +0.20%
97.68 97.58 -0.10% 10.99 11.30 +2.82%
97.86 97.90 +0.04% 11.56 11.41 -1.30%
98.73 98.73 +0.00% 7.57 7.63 +0.79%
spam mean and sdev for all runs
98.05 98.05 +0.00% 10.20 10.22 +0.20%
ham/spam mean difference: 97.25 97.25 +0.00
And a table for the unsures:
filename: basebis eng_bis eng_bi10s
ham:spam: 3668:4099 3668:4099 3668:4099
fp total: 1 1 1
fp %: 0.03 0.03 0.03
fn total: 20 20 22
fn %: 0.49 0.49 0.54
unsure t: 207 209 206
unsure %: 2.67 2.69 2.65
real cost: $71.40 $71.80 $73.20
best cost: $65.60 $64.00 $64.00
h mean: 0.80 0.79 0.80
h sdev: 6.01 6.01 6.03
s mean: 98.05 98.04 98.05
s sdev: 10.20 10.21 10.22
mean diff: 97.25 97.25 97.25
k: 6.00 6.00 5.98
If you want to test this and use the same dictionary I did, then you can get
it here:
<http://www.massey.ac.nz/~tameyer/research/english_words.txt>. (1.5Mb). It
was just a random one I found, though - I'm not claiming that it's fantastic
or anything :)
=Tony Meyer
More information about the spambayes-dev
mailing list