[Spambayes] On counting words more than once

Neil Schemenauer nas@python.ca
Sat, 28 Sep 2002 11:49:16 -0700


Tim Peters wrote:
> Testers?  You can try this by enabling the new option
> 
> [Classifier]
> count_duplicates_only_once_in_training: True

It's a win for me:

    false positive percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.500  1.000  lost  +100.00%
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.500  lost  +(was 0)

    won   0 times
    tied  8 times
    lost  2 times

    total unique fp went from 6 to 8 lost   +33.33%
    mean fp % went from 0.3 to 0.4 lost   +33.33%

    false negative percentages
        0.000  0.000  tied          
        1.000  0.500  won    -50.00%
        1.000  0.500  won    -50.00%
        0.500  0.500  tied          
        2.000  1.500  won    -25.00%
        1.500  1.000  won    -33.33%
        0.000  0.000  tied          
        0.500  0.000  won   -100.00%
        0.500  0.000  won   -100.00%
        0.000  0.000  tied          

    won   6 times
    tied  4 times
    lost  0 times

    total unique fn went from 14 to 8 won    -42.86%
    mean fn % went from 0.7 to 0.4 won    -42.86%

    ham mean                     ham sdev
      30.01   27.92   -6.96%        8.43    8.42   -0.12%
      28.50   26.74   -6.18%        8.83    8.69   -1.59%
      27.93   26.04   -6.77%        8.20    7.94   -3.17%
      29.55   27.33   -7.51%        8.24    8.23   -0.12%
      29.05   27.19   -6.40%        8.28    8.15   -1.57%
      31.40   29.48   -6.11%        9.41    9.25   -1.70%
      29.31   27.49   -6.21%        8.13    8.10   -0.37%
      29.33   27.16   -7.40%        7.86    7.89   +0.38%
      28.72   27.22   -5.22%        9.05    8.97   -0.88%
      29.04   26.87   -7.47%        7.28    7.22   -0.82%

    ham mean and sdev for all runs
      29.28   27.34   -6.63%        8.44    8.35   -1.07%

    spam mean                    spam sdev
      82.98   81.91   -1.29%        9.83   10.16   +3.36%
      82.02   81.04   -1.19%        9.92   10.09   +1.71%
      81.19   80.28   -1.12%        9.69    9.86   +1.75%
      82.51   81.66   -1.03%        9.92   10.23   +3.13%
      82.60   81.60   -1.21%       10.12   10.33   +2.08%
      82.24   81.36   -1.07%        9.25    9.71   +4.97%
      81.74   80.85   -1.09%        9.30    9.49   +2.04%
      81.70   80.64   -1.30%        9.51    9.81   +3.15%
      82.39   81.45   -1.14%        9.87   10.18   +3.14%
      82.44   81.45   -1.20%        9.49    9.73   +2.53%

    spam mean and sdev for all runs
      82.18   81.22   -1.17%        9.71    9.97   +2.68%

    ham/spam mean difference: 52.90 53.88 +0.98