[Spambayes] Better optimization loop

Tue Nov 19 02:54:09 2002

[Rob Hooft, simplifying simplex]
> ...
> I decided that we have a perfect way to optimize the ham and spam
> cutoff values in timcv already, so that I can remove these from the
> simplex optimization.

Good observation!  That should help.  simplex isn't fast in the best of
cases, and in this case ...

> To that goal I added a "delayed" flexcost to the CostCounter module
> that can use the optimal cutoffs calculated at the end of timcv.py.

Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99
and spam_cutoff of 0.995 to get rid of "impossible" FP.

> And there are only three variables left to optimize using simplex
>
> I then ran one optimization on my complete (16000+5800) corpus. The
> result is that it is fighting very hard to remove fp's while
> introducing lots of unsure messages:
>
> At the start:
>
> -> <stat> all runs false positives: 15
> -> <stat> all runs false negatives: 7
> -> <stat> all runs unsure: 189
> Standard Cost: $194.80
> Flex Cost: $607.41
> Delayed-Standard Cost: $98.80
> Delayed-Flex Cost: $310.05
> x=0.4990 p=0.1002 s=0.4537 310.05
>
> And near the end:
>
> -> <stat> all runs false positives: 5
> -> <stat> all runs false negatives: 6
> -> <stat> all runs unsure: 342
> -> <stat> all runs false positive %: 0.03125
> -> <stat> all runs false negative %: 0.103448275862
> -> <stat> all runs unsure %: 1.56880733945
> -> <stat> all runs cost: $124.40
> Standard Cost: $124.40
> Flex Cost: $589.16
> Delayed-Standard Cost: $98.60
> Delayed-Flex Cost: $212.28
> x=0.3515 p=0.2861 s=0.2467 212.28
>
> At this stage it actually managed to get the delayed standard cost
> lower by $0.20 (it has been higher than the starting value during much
> of the optimization). The Delayed-Flex cost is lowered by about 30%.
> But look at the hugely different parameters it had to use! Can someone
> else run  with these parameters and confirm that this is an extreme
> that is only  warranted by my particular corpses?

I can try <wink>.  Here's a 10-fold CV with 6K random ham and 6K random spam
from my c.l.py test data;  baseline on the left, while the right has

[Classifier]
unknown_word_prob: 0.3515
minimum_prob_strength: 0.2861
unknown_word_strength: 0.2467

filename:     base    simp
ham:spam:  6000:6000
                   6000:6000
fp total:        2       1
fp %:         0.03    0.02
fn total:        0       0
fn %:         0.00    0.00
unsure t:       46     101
unsure %:     0.38    0.84
real cost:  $29.20  $30.20
best cost:  $12.80  $11.80
h mean:       0.42    0.71
h sdev:       3.65    4.81
s mean:      99.96   99.89
s sdev:       1.21    1.94
mean diff:   99.54   99.18
k:           20.48   14.69

It did a little better here too.  The best-cost analyses show that it's also
nuking FP at the expense of unsures:

base:

-> best cost for all runs: $12.80
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.52 & 0.95
->     fp 1; fn 1; unsure ham 2; unsure spam 7
->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
-> largest ham & spam cutoffs 0.525 & 0.95
->     fp 1; fn 1; unsure ham 2; unsure spam 7
->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%

simp:

-> best cost for all runs: $12.80
-> best cost for all runs: $11.80
-> achieved at ham & spam cutoffs 0.495 & 0.995
->     fp 0; fn 0; unsure ham 10; unsure spam 49
->     fp rate 0%; fn rate 0%; unsure rate 0.492%

> Please note that to get a delayed flex cost that is this much lower
> actually means that in the unsure area there is "50% more order" than
> before the optimization!
>
> At some point Tim (was it you?) has reported that in other optimization
> techniques it has proven to be very bad to "focus" on the persistent
> and hopeless fp/fn messages. I fear this might bother me here.

Ya, I reported that from a paper wrestling with boosting, but it's a common
observation.  Even in simple settings!  Say you're doing a least-squares
linear regression on this data:

x  f(x)
-  ----
1   1.9
2   4.1
3   5.9
4 -10.0
5  10.1
6  12.1
7  13.8

If you throw out (4, -10), you get an excellent fit to everything that
remains.  If you leave it in, you still get "an answer", but it's not a good
fit to anything.  A 6th-degree polynomial fits all the data perfectly, but
the resulting snaky curve is almost certainly a terrible fit to the
population from which this sample was taken.  A few spam and ham are just
unlike their brethren, but from what I've seen of those, no mechanical
gimmick is going to classify them correctly.  Give up and be happy <wink>.

> I just started another optimization run, but lowered the cost of a fp
> from $10 to $2, and introduced another cost function that I called
> flex**2 cost because it changes the cost function for an unsure message
> from a linear function to a square function. Oops, two changes at the
> same time; but it takes such a long time to run....

When I try a new thing, I usually start with several runs but on *much* less
data per run.  If at least 3 of 5 show the effect I was hoping for, I may
push on; but if 3 of 5 don't, I either give up on it, or change the rules to
4 of 7 (if I'm really in love with the idea <wink>).

it's-almost-impossible-not-to-cheat-sometimes-ly y'rs  - tim