[spambayes-dev] removing punctuation redux
Skip Montanaro
skip at pobox.com
Wed Oct 29 11:27:47 EST 2003
Just as I was nodding off to sleep last night I realized I hadn't done the
punctuation removal right. I was stripping punctuation after testing word
length, so (for example) a nine-letter word containing four punctuation
characters got evaluated as a "long" word. I redid it this morning so the
punctuation stripping occurs before the length test and got different
results (a bit better overall I think). Here's the summary (minus the first
block). All rounds scored 500 against 4500 (using the same training sets as
yesterday).
stds.txt -> puncts.txt
false positive percentages
0.200 0.200 tied
0.200 0.400 lost +100.00%
0.200 0.200 tied
0.400 0.400 tied
1.000 1.000 tied
0.400 0.400 tied
0.000 0.000 tied
0.200 0.200 tied
0.400 0.400 tied
0.400 0.400 tied
won 0 times
tied 9 times
lost 1 times
total unique fp went from 17 to 18 lost +5.88%
mean fp % went from 0.34 to 0.36 lost +5.88%
false negative percentages
2.000 2.000 tied
0.800 1.000 lost +25.00%
2.200 2.200 tied
2.200 2.400 lost +9.09%
2.200 2.000 won -9.09%
1.200 1.200 tied
2.000 1.800 won -10.00%
2.000 1.800 won -10.00%
1.600 1.600 tied
1.200 0.800 won -33.33%
won 4 times
tied 4 times
lost 2 times
total unique fn went from 87 to 84 won -3.45%
mean fn % went from 1.74 to 1.68 won -3.45%
ham mean ham sdev
4.70 4.35 -7.45% 14.04 13.40 -4.56%
6.12 6.06 -0.98% 16.43 16.55 +0.73%
4.14 4.01 -3.14% 12.86 12.62 -1.87%
4.64 4.56 -1.72% 14.35 14.23 -0.84%
5.49 5.31 -3.28% 17.21 16.99 -1.28%
5.21 5.03 -3.45% 16.10 15.81 -1.80%
3.74 3.62 -3.21% 12.19 12.03 -1.31%
3.38 3.30 -2.37% 11.70 11.51 -1.62%
5.50 5.46 -0.73% 17.07 17.09 +0.12%
5.44 5.27 -3.13% 15.85 15.57 -1.77%
ham mean and sdev for all runs
4.84 4.70 -2.89% 14.93 14.74 -1.27%
spam mean spam sdev
89.47 89.61 +0.16% 20.40 20.36 -0.20%
91.15 91.22 +0.08% 17.17 17.12 -0.29%
91.04 90.97 -0.08% 20.00 20.22 +1.10%
89.39 89.74 +0.39% 20.88 20.73 -0.72%
89.56 89.70 +0.16% 19.92 19.72 -1.00%
90.72 90.78 +0.07% 18.41 18.48 +0.38%
90.97 91.09 +0.13% 18.66 18.54 -0.64%
91.66 91.85 +0.21% 18.20 17.90 -1.65%
91.58 91.44 -0.15% 18.03 18.24 +1.16%
92.24 92.25 +0.01% 16.18 16.22 +0.25%
spam mean and sdev for all runs
90.78 90.87 +0.10% 18.86 18.83 -0.16%
ham/spam mean difference: 85.94 86.17 +0.23
The new context diff is attached.
Would someone please try this with their training database? The shell
script I used to run the test is
#!/bin/sh
# point SBDIR at Spambayes root directory
SBDIR=$HOME/src/spambayes
# everything below here should be okay
TIMCV=$SBDIR/testtools/timcv.py
RATE=$SBDIR/testtools/rates.py
CMP=$SBDIR/testtools/cmp.py
base=$1
trial=$2
BAYESCUSTOMIZE=$base.ini ; export BAYESCUSTOMIZE
python $TIMCV -n 10 -s 12345 > $base.txt
BAYESCUSTOMIZE=$trial.ini ; export BAYESCUSTOMIZE
python $TIMCV -n 10 -s 12345 > $trial.txt
python $RATE $base.txt
python $RATE $trial.txt
python $CMP ${base}s.txt ${trial}s.txt > $base-$trial.txt
So I created an empty std.ini and a punct.ini with this:
[Tokenizer]
remove_punctuation: True
and executed
sh runtest std punct
Thx,
Skip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 2408 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031029/871de894/sb-0001.obj
More information about the spambayes-dev
mailing list