[Spambayes] Better optimization loop
Tim Peters
tim.one@comcast.net
Tue Nov 19 02:54:09 2002
[Rob Hooft, simplifying simplex]
> ...
> I decided that we have a perfect way to optimize the ham and spam
> cutoff values in timcv already, so that I can remove these from the
> simplex optimization.
Good observation! That should help. simplex isn't fast in the best of
cases, and in this case ...
> To that goal I added a "delayed" flexcost to the CostCounter module
> that can use the optimal cutoffs calculated at the end of timcv.py.
Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99
and spam_cutoff of 0.995 to get rid of "impossible" FP.
> And there are only three variables left to optimize using simplex
>
> I then ran one optimization on my complete (16000+5800) corpus. The
> result is that it is fighting very hard to remove fp's while
> introducing lots of unsure messages:
>
> At the start:
>
> -> <stat> all runs false positives: 15
> -> <stat> all runs false negatives: 7
> -> <stat> all runs unsure: 189
> Standard Cost: $194.80
> Flex Cost: $607.41
> Delayed-Standard Cost: $98.80
> Delayed-Flex Cost: $310.05
> x=0.4990 p=0.1002 s=0.4537 310.05
>
> And near the end:
>
> -> <stat> all runs false positives: 5
> -> <stat> all runs false negatives: 6
> -> <stat> all runs unsure: 342
> -> <stat> all runs false positive %: 0.03125
> -> <stat> all runs false negative %: 0.103448275862
> -> <stat> all runs unsure %: 1.56880733945
> -> <stat> all runs cost: $124.40
> Standard Cost: $124.40
> Flex Cost: $589.16
> Delayed-Standard Cost: $98.60
> Delayed-Flex Cost: $212.28
> x=0.3515 p=0.2861 s=0.2467 212.28
>
> At this stage it actually managed to get the delayed standard cost
> lower by $0.20 (it has been higher than the starting value during much
> of the optimization). The Delayed-Flex cost is lowered by about 30%.
> But look at the hugely different parameters it had to use! Can someone
> else run with these parameters and confirm that this is an extreme
> that is only warranted by my particular corpses?
I can try <wink>. Here's a 10-fold CV with 6K random ham and 6K random spam
from my c.l.py test data; baseline on the left, while the right has
[Classifier]
unknown_word_prob: 0.3515
minimum_prob_strength: 0.2861
unknown_word_strength: 0.2467
filename: base simp
ham:spam: 6000:6000
6000:6000
fp total: 2 1
fp %: 0.03 0.02
fn total: 0 0
fn %: 0.00 0.00
unsure t: 46 101
unsure %: 0.38 0.84
real cost: $29.20 $30.20
best cost: $12.80 $11.80
h mean: 0.42 0.71
h sdev: 3.65 4.81
s mean: 99.96 99.89
s sdev: 1.21 1.94
mean diff: 99.54 99.18
k: 20.48 14.69
It did a little better here too. The best-cost analyses show that it's also
nuking FP at the expense of unsures:
base:
-> best cost for all runs: $12.80
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.52 & 0.95
-> fp 1; fn 1; unsure ham 2; unsure spam 7
-> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
-> largest ham & spam cutoffs 0.525 & 0.95
-> fp 1; fn 1; unsure ham 2; unsure spam 7
-> fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
simp:
-> best cost for all runs: $12.80
-> best cost for all runs: $11.80
-> achieved at ham & spam cutoffs 0.495 & 0.995
-> fp 0; fn 0; unsure ham 10; unsure spam 49
-> fp rate 0%; fn rate 0%; unsure rate 0.492%
> Please note that to get a delayed flex cost that is this much lower
> actually means that in the unsure area there is "50% more order" than
> before the optimization!
>
> At some point Tim (was it you?) has reported that in other optimization
> techniques it has proven to be very bad to "focus" on the persistent
> and hopeless fp/fn messages. I fear this might bother me here.
Ya, I reported that from a paper wrestling with boosting, but it's a common
observation. Even in simple settings! Say you're doing a least-squares
linear regression on this data:
x f(x)
- ----
1 1.9
2 4.1
3 5.9
4 -10.0
5 10.1
6 12.1
7 13.8
If you throw out (4, -10), you get an excellent fit to everything that
remains. If you leave it in, you still get "an answer", but it's not a good
fit to anything. A 6th-degree polynomial fits all the data perfectly, but
the resulting snaky curve is almost certainly a terrible fit to the
population from which this sample was taken. A few spam and ham are just
unlike their brethren, but from what I've seen of those, no mechanical
gimmick is going to classify them correctly. Give up and be happy <wink>.
> I just started another optimization run, but lowered the cost of a fp
> from $10 to $2, and introduced another cost function that I called
> flex**2 cost because it changes the cost function for an unsure message
> from a linear function to a square function. Oops, two changes at the
> same time; but it takes such a long time to run....
When I try a new thing, I usually start with several runs but on *much* less
data per run. If at least 3 of 5 show the effect I was hoping for, I may
push on; but if 3 of 5 don't, I either give up on it, or change the rules to
4 of 7 (if I'm really in love with the idea <wink>).
it's-almost-impossible-not-to-cheat-sometimes-ly y'rs - tim
More information about the Spambayes
mailing list