[spambayes-dev] removing punctuation - no apparent benefit
Skip Montanaro
skip at pobox.com
Tue Oct 28 17:26:29 EST 2003
I tried removing punctuation from words as we discussed a week or so ago.
No overall change, though per/hap.s we just ne.ed to see more mes*sages with
em%bedded p-unctua-tion. In general, the ham means decreased slightly and
the spam means increased slightly while the standard deviations of both
groups decreased a little.
stds.txt -> puncts.txt
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
-> <stat> tested 500 hams & 500 spams against 4500 hams & 4500 spams
false positive percentages
0.200 0.200 tied
0.200 0.200 tied
0.200 0.200 tied
0.400 0.400 tied
1.000 1.000 tied
0.400 0.400 tied
0.000 0.000 tied
0.200 0.200 tied
0.400 0.400 tied
0.400 0.400 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 17 to 17 tied
mean fp % went from 0.34 to 0.34 tied
false negative percentages
2.000 2.000 tied
0.800 0.800 tied
2.200 2.200 tied
2.200 2.200 tied
2.200 2.200 tied
1.200 1.200 tied
2.000 2.000 tied
2.000 2.000 tied
1.600 1.600 tied
1.200 1.200 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 87 to 87 tied
mean fn % went from 1.74 to 1.74 tied
ham mean ham sdev
4.70 4.53 -3.62% 14.04 13.69 -2.49%
6.12 6.13 +0.16% 16.43 16.48 +0.30%
4.14 4.07 -1.69% 12.86 12.73 -1.01%
4.64 4.62 -0.43% 14.35 14.34 -0.07%
5.49 5.41 -1.46% 17.21 17.16 -0.29%
5.21 5.16 -0.96% 16.10 16.01 -0.56%
3.74 3.70 -1.07% 12.19 12.12 -0.57%
3.38 3.34 -1.18% 11.70 11.68 -0.17%
5.50 5.48 -0.36% 17.07 17.05 -0.12%
5.44 5.38 -1.10% 15.85 15.82 -0.19%
ham mean and sdev for all runs
4.84 4.78 -1.24% 14.93 14.86 -0.47%
spam mean spam sdev
89.47 89.57 +0.11% 20.40 20.37 -0.15%
91.15 91.24 +0.10% 17.17 17.02 -0.87%
91.04 91.03 -0.01% 20.00 20.03 +0.15%
89.39 89.54 +0.17% 20.88 20.76 -0.57%
89.56 89.63 +0.08% 19.92 19.91 -0.05%
90.72 90.77 +0.06% 18.41 18.36 -0.27%
90.97 91.03 +0.07% 18.66 18.60 -0.32%
91.66 91.78 +0.13% 18.20 18.08 -0.66%
91.58 91.61 +0.03% 18.03 18.03 +0.00%
92.24 92.30 +0.07% 16.18 16.18 +0.00%
spam mean and sdev for all runs
90.78 90.85 +0.08% 18.86 18.81 -0.27%
ham/spam mean difference: 85.94 86.07 +0.13
The training database I used is just my current collection of ham and spam I
use every day.
Attached is the context diff to tokenizer.py and Options.py should anyone
care to check my work for mistakes or tweak it.
Skip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 2675 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031028/f47f3299/sb.obj
More information about the spambayes-dev
mailing list