[Spambayes] Moving closer to Gary's ideal

Neil Schemenauer nas@python.ca
Sat, 21 Sep 2002 10:23:28 -0700


Tim Peters wrote:
> Everyone who tests this (please do!  it looks very promising, although my
> data only supports that it's not a regression -- I *expect* it will do
> better for some of you), pay attention to your score histograms and figure
> out the best value for spam_cutoff from them.

Here's my distributions:

    Ham distribution for all runs:
    * = 5 items
     17.50   0
     20.00   3 *
     22.50  12 ***
     25.00  57 ************
     27.50 127 **************************
     30.00 209 ******************************************
     32.50 270 ******************************************************
     35.00 292 ***********************************************************
     37.50 266 ******************************************************
     40.00 213 *******************************************
     42.50 130 **************************
     45.00  85 *****************
     47.50  52 ***********
     50.00  49 **********
     52.50  22 *****
     55.00  10 **
     57.50   3 *
     60.00   0

    Spam distribution for all runs:
    * = 5 items
     45.00   0
     47.50   1 *
     50.00   4 *
     52.50  10 **
     55.00  30 ******
     57.50  64 *************
     60.00  88 ******************
     62.50 151 *******************************
     65.00 192 ***************************************
     67.50 269 ******************************************************
     70.00 256 ****************************************************
     72.50 215 *******************************************
     75.00 164 *********************************
     77.50 115 ***********************
     80.00 107 **********************
     82.50  73 ***************
     85.00  42 *********
     87.50  18 ****
     90.00   1 *
     92.50   0

With "spam_cutoff: 0.56":

    false positive percentages
        0.667  0.667  tied
        0.000  0.333  lost  +(was 0)
        1.000  0.667  won    -33.30%
        0.333  0.667  lost  +100.30%
        0.000  0.333  lost  +(was 0)
        0.000  0.000  tied

    won   1 times
    tied  2 times
    lost  3 times

    total unique fp went from 6 to 8 lost   +33.33%
    mean fp % went from 0.333333333333 to 0.444444444444 lost   +33.33%

    false negative percentages
        0.333  1.333  lost  +300.30%
        1.333  1.667  lost   +25.06%
        1.667  1.667  tied
        0.333  0.000  won   -100.00%
        1.333  2.000  lost   +50.04%
        1.667  1.667  tied

    won   1 times
    tied  2 times
    lost  3 times

    total unique fn went from 20 to 25 lost   +25.00%
    mean fn % went from 1.11111111111 to 1.38888888889 lost   +25.00%

Reducing max_discriminators seems to make things worse.

  Neil