[Spambayes] More experiments with weaktest.py

Rob Hooft rob@hooft.net
Sat Nov 9 23:46:02 2002


These were results of weaktest with default parameters:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 336 (5.1%)
   Trained on 178 ham and 162 spam
   fp: 2 fn: 2
   Total cost: $89.20

If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical 
(spam_cutoff is 90 by default):

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 442 (6.8%)
   Trained on 292 ham and 152 spam
   fp: 2 fn: 0
   Total cost: $108.40

So the database grows by 30% but it didn't help my cost. The training 
set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to 
the default 20:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 304 (4.6%)
   Trained on 213 ham and 101 spam
   fp: 7 fn: 3
   Total cost: $133.80

This reduces the database by only 10%, but at very high fp cost. Same
2:1 unbalance in the training set.
Back to the default 20:90 then, and set the minimum_prob_strength to 0.0:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 933 (14.3%)
   Trained on 497 ham and 437 spam
   fp: 0 fn: 1
   Total cost: $187.60

OK, so that didn't work either. How about setting it to 0.2?

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 304 (4.6%)
   Trained on 134 ham and 177 spam
   fp: 2 fn: 5
   Total cost: $85.80

Hm. That is slightly better. Funny, we are suddenly training on more 
spam than ham.... Back to 0.1 anyway ---the differences are too small--- 
and set robinson_probability_x = 0.3 (default is 0.5):

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 602 (9.2%)
   Trained on 54 ham and 616 spam
   fp: 1 fn: 67
   Total cost: $197.40

Very interesting: this changes the training ratio to 1:12, at huge cost!
(less than one in three spams was recognized solidly as such).
Wonder what this could do if changed together with the cutoff....
Lets move it back to 0.5, and try "robinson_probability_s = 0.3":

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 348 (5.3%)
   Trained on 237 ham and 120 spam
   fp: 7 fn: 2
   Total cost: $141.60

Ouf.

I am back with the defaults, but I'd still like to do an automated 
optimization of everything simultaneously. Might try that.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/




More information about the Spambayes mailing list