[Spambayes] Moving closer to Gary's ideal
Neil Schemenauer
nas@python.ca
Sat, 21 Sep 2002 10:23:28 -0700
Tim Peters wrote:
> Everyone who tests this (please do! it looks very promising, although my
> data only supports that it's not a regression -- I *expect* it will do
> better for some of you), pay attention to your score histograms and figure
> out the best value for spam_cutoff from them.
Here's my distributions:
Ham distribution for all runs:
* = 5 items
17.50 0
20.00 3 *
22.50 12 ***
25.00 57 ************
27.50 127 **************************
30.00 209 ******************************************
32.50 270 ******************************************************
35.00 292 ***********************************************************
37.50 266 ******************************************************
40.00 213 *******************************************
42.50 130 **************************
45.00 85 *****************
47.50 52 ***********
50.00 49 **********
52.50 22 *****
55.00 10 **
57.50 3 *
60.00 0
Spam distribution for all runs:
* = 5 items
45.00 0
47.50 1 *
50.00 4 *
52.50 10 **
55.00 30 ******
57.50 64 *************
60.00 88 ******************
62.50 151 *******************************
65.00 192 ***************************************
67.50 269 ******************************************************
70.00 256 ****************************************************
72.50 215 *******************************************
75.00 164 *********************************
77.50 115 ***********************
80.00 107 **********************
82.50 73 ***************
85.00 42 *********
87.50 18 ****
90.00 1 *
92.50 0
With "spam_cutoff: 0.56":
false positive percentages
0.667 0.667 tied
0.000 0.333 lost +(was 0)
1.000 0.667 won -33.30%
0.333 0.667 lost +100.30%
0.000 0.333 lost +(was 0)
0.000 0.000 tied
won 1 times
tied 2 times
lost 3 times
total unique fp went from 6 to 8 lost +33.33%
mean fp % went from 0.333333333333 to 0.444444444444 lost +33.33%
false negative percentages
0.333 1.333 lost +300.30%
1.333 1.667 lost +25.06%
1.667 1.667 tied
0.333 0.000 won -100.00%
1.333 2.000 lost +50.04%
1.667 1.667 tied
won 1 times
tied 2 times
lost 3 times
total unique fn went from 20 to 25 lost +25.00%
mean fn % went from 1.11111111111 to 1.38888888889 lost +25.00%
Reducing max_discriminators seems to make things worse.
Neil