[Spambayes] Result of a test

Thu, 03 Oct 2002 23:16:09 -0400

Hi,

> > 
> > I think it can be interesting to try to remove the punctuation (the . , 
> > ? !) at the end of a word
> > and then count it as the same word and do the same thing with the 
> > plural (luncheon and luncheons) based
> > on a dictionary like the one in ispell.
> 
> Tim played with this very early in the project.  Turned out that keeping
> punctuation, preserving case, and not stemming, were all wins.  A bit
> counter-intuitive, but there you go.  Experiment beats intuition every
> time in this project.

I read the comments in the file tokenizer.py and saw that It was already
tried. Sorry...

So I tried something else ;-) 
Since spam want to catch your attention they use ? ! very often. So 
I remove only the ',' and '.' and ':' 

This is the patch:
        # Tokenize everything in the body.
            for w in text.split():
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= 12:
                    if w[-1] == ',' or w[-1] == '.' or w[-1] == ':':
                        w = w[:-1];
                    yield w

                elif n >= 3:
                    for t in tokenize_word(w):
                        yield t

Please don't flame me this is my first modification of python code
I'm more a C and C++ guy....

This is the result:
run1s -> run2s
-> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
-> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
-> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
-> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
-> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams
-> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
-> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
-> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
-> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
-> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams

false positive percentages
    0.889  0.444  won    -50.06%
    0.826  1.240  lost   +50.12%
    1.594  1.594  tied          
    1.304  1.304  tied          
    0.000  0.000  tied          

won   1 times
tied  3 times
lost  1 times

total unique fp went from 11 to 11 tied          
mean fp % went from 0.922661698796 to 0.916417438007 won     -0.68%

false negative percentages
    0.717  0.717  tied          
    0.727  0.364  won    -49.93%
    1.342  1.678  lost   +25.04%
    0.000  0.368  lost  +(was 0)
    0.746  0.373  won    -50.00%

won   2 times
tied  1 times
lost  2 times

total unique fn went from 10 to 10 tied          
mean fn % went from 0.706533828263 to 0.699823195589 won     -0.95%

ham mean                     ham sdev
  24.18   24.58   +1.65%        9.24    8.93   -3.35%
  25.70   26.23   +2.06%        8.47    8.21   -3.07%
  25.51   25.87   +1.41%        9.12    8.90   -2.41%
  25.01   25.34   +1.32%        8.08    8.07   -0.12%
  24.93   25.36   +1.72%        8.27    8.17   -1.21%

ham mean and sdev for all runs
  25.08   25.49   +1.63%        8.67    8.49   -2.08%

spam mean                    spam sdev
  80.43   79.91   -0.65%        8.79    8.78   -0.11%
  79.72   79.38   -0.43%        8.30    8.12   -2.17%
  79.67   79.25   -0.53%        8.83    8.69   -1.59%
  80.09   79.73   -0.45%        8.15    8.17   +0.25%
  79.84   79.48   -0.45%        9.35    9.07   -2.99%

spam mean and sdev for all runs
  79.95   79.55   -0.50%        8.70    8.58   -1.38%

ham/spam mean difference: 54.87 54.06 -0.81

papaDoc