[Spambayes] Re: On counting words more than once
Guido van Rossum
guido@python.org
Sun, 29 Sep 2002 19:22:15 -0400
> Because Neil, Guido and I all reported improvement via counting
> duplicate words (within a message) only once during training, I
> removed the recent option for trying this, and we do this all the
> time now. The checkin comment is below. Note that you may need to
> change spam_cutoff!
I ran this comparison on my full corpus, and it made a significant
improvement for the f-p rate, while increasing the f-n by a tiny
fraction. So I agree it's a win!
false positive percentages
0.109 0.085 won -22.02%
0.036 0.024 won -33.33%
0.084 0.060 won -28.57%
0.060 0.060 tied
0.120 0.084 won -30.00%
0.121 0.109 won -9.92%
0.037 0.037 tied
0.061 0.049 won -19.67%
0.048 0.048 tied
0.095 0.095 tied
won 6 times
tied 4 times
lost 0 times
total unique fp went from 64 to 54 won -15.62%
mean fp % went from 0.0770443274685 to 0.0649864716148 won -15.65%
false negative percentages
0.251 0.376 lost +49.80%
0.311 0.249 won -19.94%
0.128 0.128 tied
0.315 0.252 won -20.00%
0.278 0.340 lost +22.30%
0.314 0.283 won -9.87%
0.398 0.459 lost +15.33%
0.154 0.154 tied
0.190 0.190 tied
0.365 0.334 won -8.49%
won 4 times
tied 3 times
lost 3 times
total unique fn went from 87 to 89 lost +2.30%
mean fn % went from 0.270497065577 to 0.276634600712 lost +2.27%
--Guido van Rossum (home page: http://www.python.org/~guido/)