[spambayes-dev] [ 817813 ] Consider bad spelling a sign of spam

Wed Jan 7 18:13:30 EST 2004

The feature request says:
"""
Add a spelling checker and reasonable sized dictionary.
If more than xx% of the message is misspelled (esp
the subject), consider it to be spam. Many emails have
gotten past Spam Bayes recently because their spelling
is like "bfuqclvfphz".
"""
[http://sourceforge.net/tracker/?group_id=61702&atid=498106&func=detail&aid=
817813]

"consider it to be spam" isn't something we do, of course :)  I created a
patch to generate a token that reflects the percentage of words in the
message that are in a particular (English) dictionary.  So one extra token
per message, guaranteed, with a maximum of 100 new tokens.

Results:
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
[all other stat lines are more-or-less the same as this, and have been
snipped]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.246  0.246  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.557  0.557  tied
    0.559  0.279  won    -50.09%
    0.287  0.287  tied
    0.000  0.000  tied

won   1 times
tied  9 times
lost  0 times

total unique fp went from 6 to 5 won    -16.67%
mean fp % went from 0.164881884948 to 0.136948924055 won    -16.94%

false negative percentages
    0.253  0.253  tied
    0.781  0.781  tied
    0.462  0.462  tied
    0.756  0.756  tied
    0.243  0.243  tied
    0.247  0.247  tied
    0.240  0.240  tied
    0.494  0.494  tied
    0.973  0.973  tied
    0.454  0.454  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 20 to 20 tied
mean fn % went from 0.490257037938 to 0.490257037938 tied

ham mean                     ham sdev
   1.18    1.18   +0.00%        7.76    7.72   -0.52%
   0.99    0.99   +0.00%        6.64    6.62   -0.30%
   0.84    0.85   +1.19%        6.14    6.22   +1.30%
   1.99    2.03   +2.01%        9.46    9.61   +1.59%
   0.49    0.51   +4.08%        3.59    3.64   +1.39%
   0.85    0.85   +0.00%        5.45    5.45   +0.00%
   1.16    1.17   +0.86%        9.30    9.29   -0.11%
   1.20    1.20   +0.00%        8.13    8.00   -1.60%
   1.55    1.56   +0.65%        8.05    8.07   +0.25%
   0.47    0.48   +2.13%        3.22    3.28   +1.86%

ham mean and sdev for all runs
   1.08    1.09   +0.93%        7.13    7.14   +0.14%

spam mean                    spam sdev
  98.75   98.77   +0.02%        8.72    8.59   -1.49%
  97.67   97.66   -0.01%       11.26   11.24   -0.18%
  98.08   98.07   -0.01%       10.12   10.09   -0.30%
  98.16   98.17   +0.01%       10.19   10.11   -0.79%
  98.35   98.35   +0.00%        8.77    8.79   +0.23%
  98.45   98.47   +0.02%        8.97    8.83   -1.56%
  98.35   98.36   +0.01%        9.73    9.62   -1.13%
  98.25   98.22   -0.03%        9.16    9.25   +0.98%
  97.93   97.93   +0.00%       11.99   11.90   -0.75%
  98.92   98.92   +0.00%        7.62    7.58   -0.52%

spam mean and sdev for all runs
  98.30   98.30   +0.00%        9.72    9.66   -0.62%

ham/spam mean difference: 97.22 97.21 -0.01

I wondered whether 100 tokens was too many and bucketing this would help, so
I changed it to truncate to the nearest 10%.  The cmp.py results are
basically the same, but here's a table.py of the three - note that with the
100 the number of unsures went up, but with 10 there was still the minor
gain with the same number of unsures.

filename:        bases        engs      eng10s
ham:spam:    3668:4099   3668:4099   3668:4099
fp total:            6           5           5
fp %:             0.16        0.14        0.14
fn total:           20          20          20
fn %:             0.49        0.49        0.49
unsure t:          178         183         178
unsure %:         2.29        2.36        2.29
real cost:     $115.60     $106.60     $105.60
best cost:      $93.00      $94.20      $93.60
h mean:           1.08        1.09        1.09
h sdev:           7.13        7.14        7.15
s mean:          98.30       98.30       98.30
s sdev:           9.72        9.66        9.68
mean diff:       97.22       97.21       97.21
k:                5.77        5.79        5.78

Bigram results aren't so great.  Results with the original 100 buckets:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.279  0.279  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 1 to 1 tied
mean fp % went from 0.0278551532033 to 0.0278551532033 tied

false negative percentages
    0.253  0.253  tied
    1.042  1.042  tied
    0.693  0.693  tied
    0.252  0.252  tied
    0.728  0.728  tied
    0.000  0.000  tied
    0.481  0.481  tied
    0.494  0.494  tied
    0.730  0.730  tied
    0.227  0.227  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 20 to 20 tied
mean fn % went from 0.489899714703 to 0.489899714703 tied

ham mean                     ham sdev
   0.95    0.98   +3.16%        6.64    6.86   +3.31%
   0.83    0.82   -1.20%        5.53    5.49   -0.72%
   0.49    0.47   -4.08%        4.08    4.10   +0.49%
   1.53    1.55   +1.31%        8.16    8.29   +1.59%
   0.30    0.31   +3.33%        3.25    3.26   +0.31%
   0.70    0.70   +0.00%        5.27    5.27   +0.00%
   0.85    0.83   -2.35%        7.11    7.06   -0.70%
   0.93    0.90   -3.23%        7.23    7.02   -2.90%
   0.90    0.88   -2.22%        6.47    6.36   -1.70%
   0.41    0.41   +0.00%        4.07    4.07   +0.00%

ham mean and sdev for all runs
   0.80    0.79   -1.25%        6.01    6.01   +0.00%

spam mean                    spam sdev
  98.71   98.74   +0.03%        7.83    7.78   -0.64%
  97.38   97.36   -0.02%       12.55   12.54   -0.08%
  97.78   97.77   -0.01%       11.09   11.06   -0.27%
  97.89   97.87   -0.02%       10.49   10.49   +0.00%
  97.90   97.94   +0.04%       10.03    9.97   -0.60%
  98.32   98.29   -0.03%        8.63    8.74   +1.27%
  98.19   98.21   +0.02%       10.21   10.12   -0.88%
  97.68   97.56   -0.12%       10.99   11.18   +1.73%
  97.86   97.88   +0.02%       11.56   11.56   +0.00%
  98.73   98.72   -0.01%        7.57    7.65   +1.06%

spam mean and sdev for all runs
  98.05   98.04   -0.01%       10.20   10.21   +0.10%

ham/spam mean difference: 97.25 97.25 +0.00

And with 10 buckets:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.279  0.279  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 1 to 1 tied
mean fp % went from 0.0278551532033 to 0.0278551532033 tied

false negative percentages
    0.253  0.253  tied
    1.042  1.042  tied
    0.693  0.693  tied
    0.252  0.252  tied
    0.728  0.728  tied
    0.000  0.000  tied
    0.481  0.721  lost   +49.90%
    0.494  0.741  lost   +50.00%
    0.730  0.730  tied
    0.227  0.227  tied

won   0 times
tied  8 times
lost  2 times

total unique fn went from 20 to 22 lost   +10.00%
mean fn % went from 0.489899714703 to 0.538629534266 lost    +9.95%

ham mean                     ham sdev
   0.95    0.98   +3.16%        6.64    6.86   +3.31%
   0.83    0.81   -2.41%        5.53    5.48   -0.90%
   0.49    0.47   -4.08%        4.08    4.07   -0.25%
   1.53    1.55   +1.31%        8.16    8.24   +0.98%
   0.30    0.31   +3.33%        3.25    3.26   +0.31%
   0.70    0.70   +0.00%        5.27    5.28   +0.19%
   0.85    0.84   -1.18%        7.11    7.07   -0.56%
   0.93    0.90   -3.23%        7.23    7.14   -1.24%
   0.90    0.91   +1.11%        6.47    6.50   +0.46%
   0.41    0.42   +2.44%        4.07    4.15   +1.97%

ham mean and sdev for all runs
   0.80    0.80   +0.00%        6.01    6.03   +0.33%

spam mean                    spam sdev
  98.71   98.74   +0.03%        7.83    7.82   -0.13%
  97.38   97.39   +0.01%       12.55   12.57   +0.16%
  97.78   97.80   +0.02%       11.09   11.10   +0.09%
  97.89   97.91   +0.02%       10.49   10.42   -0.67%
  97.90   97.96   +0.06%       10.03    9.92   -1.10%
  98.32   98.29   -0.03%        8.63    8.80   +1.97%
  98.19   98.20   +0.01%       10.21   10.23   +0.20%
  97.68   97.58   -0.10%       10.99   11.30   +2.82%
  97.86   97.90   +0.04%       11.56   11.41   -1.30%
  98.73   98.73   +0.00%        7.57    7.63   +0.79%

spam mean and sdev for all runs
  98.05   98.05   +0.00%       10.20   10.22   +0.20%

ham/spam mean difference: 97.25 97.25 +0.00

And a table for the unsures:

filename:      basebis     eng_bis   eng_bi10s
ham:spam:    3668:4099   3668:4099   3668:4099
fp total:            1           1           1
fp %:             0.03        0.03        0.03
fn total:           20          20          22
fn %:             0.49        0.49        0.54
unsure t:          207         209         206
unsure %:         2.67        2.69        2.65
real cost:      $71.40      $71.80      $73.20
best cost:      $65.60      $64.00      $64.00
h mean:           0.80        0.79        0.80
h sdev:           6.01        6.01        6.03
s mean:          98.05       98.04       98.05
s sdev:          10.20       10.21       10.22
mean diff:       97.25       97.25       97.25
k:                6.00        6.00        5.98

If you want to test this and use the same dictionary I did, then you can get
it here:
<http://www.massey.ac.nz/~tameyer/research/english_words.txt>. (1.5Mb).  It
was just a random one I found, though - I'm not claiming that it's fantastic
or anything :)

=Tony Meyer