[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Wed, 25 Sep 2002 20:04:31 -0400


Here's an interesting experiment:  max_discriminators=1.  That is, only look
at the single strongest clue in a message.  Unsurprisingly, this gives a
very Graham-like bipolar distribution.  But it does surprisingly well for me
(max_discriminators 150 vs 1):

false positive percentages
    0.500  1.500  lost  +200.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.500  lost  +(was 0)
    0.000  0.500  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.500  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.500  lost  +(was 0)

won   0 times
tied  5 times
lost  5 times

total unique fp went from 1 to 7 lost  +600.00%
mean fp % went from 0.05 to 0.35 lost  +600.00%

false negative percentages
    0.000  0.000  tied
    0.000  0.500  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  1.000  lost  +(was 0)
    0.000  0.000  tied
    0.000  0.500  lost  +(was 0)
    0.000  0.000  tied

won   0 times
tied  7 times
lost  3 times

total unique fn went from 0 to 4 lost  +(was 0)
mean fn % went from 0.0 to 0.2 lost  +(was 0)

ham mean                     ham sdev
  33.01    1.57  -95.24%        6.26   12.13  +93.77%
  32.19    0.05  -99.84%        5.38    0.16  -97.03%
  32.99    0.04  -99.88%        5.60    0.11  -98.04%
  33.46    0.54  -98.39%        5.77    7.03  +21.84%
  33.16    0.57  -98.28%        5.56    7.02  +26.26%
  32.81    0.06  -99.82%        5.72    0.15  -97.38%
  33.38    0.55  -98.35%        5.76    7.02  +21.87%
  32.55    0.07  -99.78%        5.70    0.35  -93.86%
  33.11    0.07  -99.79%        5.52    0.25  -95.47%
  34.21    0.55  -98.39%        5.84    7.01  +20.03%

ham mean and sdev for all runs
  33.09    0.41  -98.76%        5.73    5.89   +2.79%

spam mean                    spam sdev
  82.95   99.90  +20.43%        6.82    0.15  -97.80%
  82.17   99.36  +20.92%        6.34    7.04  +11.04%
  82.06   99.88  +21.72%        6.14    0.28  -95.44%
  82.39   99.91  +21.26%        5.93    0.10  -98.31%
  82.53   99.89  +21.03%        7.00    0.14  -98.00%
  82.76   99.91  +20.72%        6.56    0.17  -97.41%
  82.06   98.91  +20.53%        5.73    9.82  +71.38%
  82.26   99.87  +21.41%        5.97    0.28  -95.31%
  82.65   99.38  +20.24%        6.71    6.60   -1.64%
  83.43   99.88  +19.72%        6.37    0.32  -94.98%

spam mean and sdev for all runs
  82.53   99.69  +20.79%        6.37    4.37  -31.40%

ham/spam mean difference: 49.44 99.28 +49.84

The wild swings across runs in the ham and spam sdevs suggest it's not a
very stable approach <heh>.