[Spambayes] Result of a test
Remi Ricard
papaDoc@videotron.ca
Thu, 03 Oct 2002 23:16:09 -0400
Hi,
> >
> > I think it can be interesting to try to remove the punctuation (the . ,
> > ? !) at the end of a word
> > and then count it as the same word and do the same thing with the
> > plural (luncheon and luncheons) based
> > on a dictionary like the one in ispell.
>
> Tim played with this very early in the project. Turned out that keeping
> punctuation, preserving case, and not stemming, were all wins. A bit
> counter-intuitive, but there you go. Experiment beats intuition every
> time in this project.
I read the comments in the file tokenizer.py and saw that It was already
tried. Sorry...
So I tried something else ;-)
Since spam want to catch your attention they use ? ! very often. So
I remove only the ',' and '.' and ':'
This is the patch:
# Tokenize everything in the body.
for w in text.split():
n = len(w)
# Make sure this range matches in tokenize_word().
if 3 <= n <= 12:
if w[-1] == ',' or w[-1] == '.' or w[-1] == ':':
w = w[:-1];
yield w
elif n >= 3:
for t in tokenize_word(w):
yield t
Please don't flame me this is my first modification of python code
I'm more a C and C++ guy....
This is the result:
run1s -> run2s
-> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
-> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
-> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
-> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
-> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams
-> <stat> tested 225 hams & 279 spams against 941 hams & 1113 spams
-> <stat> tested 242 hams & 275 spams against 924 hams & 1117 spams
-> <stat> tested 251 hams & 298 spams against 915 hams & 1094 spams
-> <stat> tested 230 hams & 272 spams against 936 hams & 1120 spams
-> <stat> tested 218 hams & 268 spams against 948 hams & 1124 spams
false positive percentages
0.889 0.444 won -50.06%
0.826 1.240 lost +50.12%
1.594 1.594 tied
1.304 1.304 tied
0.000 0.000 tied
won 1 times
tied 3 times
lost 1 times
total unique fp went from 11 to 11 tied
mean fp % went from 0.922661698796 to 0.916417438007 won -0.68%
false negative percentages
0.717 0.717 tied
0.727 0.364 won -49.93%
1.342 1.678 lost +25.04%
0.000 0.368 lost +(was 0)
0.746 0.373 won -50.00%
won 2 times
tied 1 times
lost 2 times
total unique fn went from 10 to 10 tied
mean fn % went from 0.706533828263 to 0.699823195589 won -0.95%
ham mean ham sdev
24.18 24.58 +1.65% 9.24 8.93 -3.35%
25.70 26.23 +2.06% 8.47 8.21 -3.07%
25.51 25.87 +1.41% 9.12 8.90 -2.41%
25.01 25.34 +1.32% 8.08 8.07 -0.12%
24.93 25.36 +1.72% 8.27 8.17 -1.21%
ham mean and sdev for all runs
25.08 25.49 +1.63% 8.67 8.49 -2.08%
spam mean spam sdev
80.43 79.91 -0.65% 8.79 8.78 -0.11%
79.72 79.38 -0.43% 8.30 8.12 -2.17%
79.67 79.25 -0.53% 8.83 8.69 -1.59%
80.09 79.73 -0.45% 8.15 8.17 +0.25%
79.84 79.48 -0.45% 9.35 9.07 -2.99%
spam mean and sdev for all runs
79.95 79.55 -0.50% 8.70 8.58 -1.38%
ham/spam mean difference: 54.87 54.06 -0.81
papaDoc