[Spambayes] Two Scheme Enter, One Scheme Leave.

Anthony Baxter anthony@interlink.com.au
Thu, 26 Sep 2002 01:18:11 +1000


Part the second...

A brief sidetrip into fiddling robinson_probability_x showed that
setting it to 0.4 and 0.6 (instead of the default 0.5) had no real
affect on fp/fn numbers, but resulted in average ham and spam numbers
being around 1% lower and higher, respectively. 


min_prob_strength is next. Carrying over best so far, 
(cutoff=0.6, a=0.1, x=0.5)

        fp      fn      fp+fn
0.00     7      50       57
0.05     8      23       31
0.08     9      21       30
0.09     9      21       30
0.10     9      21       30
0.11    12      20       32
0.12    12      20       32
0.15    13      19       32
0.20    23      19       42
0.25    23      18       41
0.30    28      15       43
0.35    29      17       46
0.40    36      17       53
0.45    51      17       68
0.49    75      32      107

The "best" cutoff numbers for the different min_prob_strength settings:

        fp      fn      fp+fn   cutoff
0.00    13      24       37     0.575
0.05     8      23       31     0.6
0.08     9      21       30     0.6
0.09     9      21       30     0.6
0.10     9      21       30     0.6
0.11    12      20       32	0.6
0.12    12      20       32	0.6
0.15    13      19       32     0.6
0.20    13      26       39     0.625
0.25    23      18       41     0.6
0.30    28      15       43     0.6
0.35    23      21       44     0.625
0.40    27      18       45     0.625
0.45    40      27       67     0.625
0.49    77      26      103     0.575

That's it for tonight. If people (well, ok, Tim) want more detail, 
let me know, and let me know what you want to see. All up, just the 
test_foo_2s.txt summary files alone are about 4M of data (about 35 
test runs). If the tram ride to work tomorrow is slow, I might write 
something to run through all the data files and try to load it all 
up into some sort of 4d array or something, see if it sees anything 
interesting...

Tomorrow, I'll try the current "best settings"
(cutoff=0.6, a=0.1, x=0.5, min_prob_strength=0.09) 
with a few different seeds, compared to Graham, and also try with
different spam/ham corpus sizes.

Anthony