[Spambayes] More experiments with weaktest.py
Tim Peters
tim.one@comcast.net
Sun Nov 10 07:27:38 2002
[Rob Hooft]
> These were results of weaktest with default parameters:
Very interesting! I'll have to try that too. Note that in my live email
experiment here, I'm (except for the very start) also scoring/training msgs
in (with small lapses) the order they arrive. It's been reported before
that this helps; although I still haven't run a controlled experiment on
that, my *impression* is that it does help.
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 336 (5.1%)
> Trained on 178 ham and 162 spam
> fp: 2 fn: 2
> Total cost: $89.20
>
> If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical
> (spam_cutoff is 90 by default):
The asymmetry is intentional: most people hate FP more than FN, so by
default I made it harder for a thing to get called spam. In test after test
we've also seen that spam has a tighter score distribution than ham, which
is a more formal justification for setting the spam cutoff closer to its
endpoint than the ham cutoff. Setting ham_cutoff as low as 10 is for the
truly paranoid <0.9 wink>.
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 442 (6.8%)
> Trained on 292 ham and 152 spam
> fp: 2 fn: 0
> Total cost: $108.40
>
> So the database grows by 30% but it didn't help my cost. The training
> set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to
> the default 20:
>
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 304 (4.6%)
> Trained on 213 ham and 101 spam
> fp: 7 fn: 3
> Total cost: $133.80
>
> This reduces the database by only 10%, but at very high fp cost. Same
> 2:1 unbalance in the training set.
> Back to the default 20:90 then, and set the minimum_prob_strength to 0.0:
>
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 933 (14.3%)
> Trained on 497 ham and 437 spam
> fp: 0 fn: 1
> Total cost: $187.60
>
> OK, so that didn't work either. How about setting it to 0.2?
>
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 304 (4.6%)
> Trained on 134 ham and 177 spam
> fp: 2 fn: 5
> Total cost: $85.80
>
> Hm. That is slightly better. Funny, we are suddenly training on more
> spam than ham.... Back to 0.1 anyway ---the differences are too small---
> and set robinson_probability_x = 0.3 (default is 0.5):
>
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 602 (9.2%)
> Trained on 54 ham and 616 spam
> fp: 1 fn: 67
> Total cost: $197.40
>
> Very interesting: this changes the training ratio to 1:12, at huge cost!
> (less than one in three spams was recognized solidly as such).
Note that in calculations I reported a day or two ago, the measured mean of
spamprobs across 3 different corpora was > 0.5, but not by a lot. .3 moves
it outside the range minimum_prob_strength ignores, so now every "new word"
is instantly taken as a ham clue, where before all new words were ignored by
default. So that it grossly inflated the FN rate isn't surprising;
everything that will *eventually* become a hapax is initially taken to be a
ham clue, even when it's never been seen before.
> Wonder what this could do if changed together with the cutoff....
> Lets move it back to 0.5, and try "robinson_probability_s = 0.3":
>
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 348 (5.3%)
> Trained on 237 ham and 120 spam
> fp: 7 fn: 2
> Total cost: $141.60
>
> Ouf.
I hope you're at least gaining some respect for how much work went into
picking the defaults <wink>.
> I am back with the defaults, but I'd still like to do an automated
> optimization of everything simultaneously. Might try that.
Now *that* could be a useful system regardless of scheme. I've tended to do
hill-climbing across one dimension at a time, occasionally moving batches of
params random amounts at once (to see whether that kicks it out of a
stubborn local minimum).
More information about the Spambayes
mailing list