[Spambayes] Re: On counting words more than once

Guido van Rossum guido@python.org
Sun, 29 Sep 2002 19:22:15 -0400


> Because Neil, Guido and I all reported improvement via counting
> duplicate words (within a message) only once during training, I
> removed the recent option for trying this, and we do this all the
> time now.  The checkin comment is below.  Note that you may need to
> change spam_cutoff!

I ran this comparison on my full corpus, and it made a significant
improvement for the f-p rate, while increasing the f-n by a tiny
fraction.  So I agree it's a win!

false positive percentages
    0.109  0.085  won    -22.02%
    0.036  0.024  won    -33.33%
    0.084  0.060  won    -28.57%
    0.060  0.060  tied          
    0.120  0.084  won    -30.00%
    0.121  0.109  won     -9.92%
    0.037  0.037  tied          
    0.061  0.049  won    -19.67%
    0.048  0.048  tied          
    0.095  0.095  tied          

won   6 times
tied  4 times
lost  0 times

total unique fp went from 64 to 54 won    -15.62%
mean fp % went from 0.0770443274685 to 0.0649864716148 won    -15.65%

false negative percentages
    0.251  0.376  lost   +49.80%
    0.311  0.249  won    -19.94%
    0.128  0.128  tied          
    0.315  0.252  won    -20.00%
    0.278  0.340  lost   +22.30%
    0.314  0.283  won     -9.87%
    0.398  0.459  lost   +15.33%
    0.154  0.154  tied          
    0.190  0.190  tied          
    0.365  0.334  won     -8.49%

won   4 times
tied  3 times
lost  3 times

total unique fn went from 87 to 89 lost    +2.30%
mean fn % went from 0.270497065577 to 0.276634600712 lost    +2.27%

--Guido van Rossum (home page: http://www.python.org/~guido/)