[Spambayes] Better optimization loop

Tue Nov 19 17:47:02 2002

Tim Peters wrote:
> [Rob Hooft, simplifying simplex]
> 
>>...
>>I decided that we have a perfect way to optimize the ham and spam
>>cutoff values in timcv already, so that I can remove these from the
>>simplex optimization.
> 
> 
> Good observation!  That should help.  simplex isn't fast in the best of
> cases, and in this case ...

Anyone that has a faster optimization algorithm lying around is welcome 
to replace my Simplex code.

>>To that goal I added a "delayed" flexcost to the CostCounter module
>>that can use the optimal cutoffs calculated at the end of timcv.py.
> 
> Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99
> and spam_cutoff of 0.995 to get rid of "impossible" FP.

They are in any case better than any other alternative I could think of. 
But if you disagree, you can change the order in which the 
CostCounter.default() builds up the cost counters; the optimization 
always uses the last one.

> It did a little better here too.  The best-cost analyses show that it's also
> nuking FP at the expense of unsures:
> 
> base:
> 
> -> best cost for all runs: $12.80
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.52 & 0.95
> ->     fp 1; fn 1; unsure ham 2; unsure spam 7
> ->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
> -> largest ham & spam cutoffs 0.525 & 0.95
> ->     fp 1; fn 1; unsure ham 2; unsure spam 7
> ->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
> 
> simp:
> 
> -> best cost for all runs: $12.80
> -> best cost for all runs: $11.80
> -> achieved at ham & spam cutoffs 0.495 & 0.995
> ->     fp 0; fn 0; unsure ham 10; unsure spam 49
> ->     fp rate 0%; fn rate 0%; unsure rate 0.492%

Very similar to my case. I'm seriously thinking about removing the 
"hopeless" and "almost hopeless" messages from my corpses. I agree with 
the bayesian statistics that they can't be correctly classified.

> Ya, I reported that from a paper wrestling with boosting, but it's a common
> observation.  Even in simple settings!  Say you're doing a least-squares
> linear regression on this data:
> 
> x  f(x)
> -  ----
> 1   1.9
> 2   4.1
> 3   5.9
> 4 -10.0
> 5  10.1
> 6  12.1
> 7  13.8
> 
> If you throw out (4, -10), you get an excellent fit to everything that
> remains.  If you leave it in, you still get "an answer", but it's not a good
> fit to anything.

Press et al. report about a "robust fit", which is not a least squares 
but a least absolute deviates fit. It is insensitive to outliers.
Is there an analog idea for us?

> When I try a new thing, I usually start with several runs but on *much* less
> data per run.  If at least 3 of 5 show the effect I was hoping for, I may
> push on; but if 3 of 5 don't, I either give up on it, or change the rules to
> 4 of 7 (if I'm really in love with the idea <wink>).

These optimizations are very sensitive to step-functions, so I need lots 
of data to run them. With a small data set it will stop wherever you 
start it.

Further results I obtained: My idea of running with an fp cost of $2 and 
a square cost function didn't work. It doesn't optimize to a consistent 
position. Increasing the cost of an fp back to $10 and running with the 
same square function did do a reasonable job, it optimized to:

[Classifier]
unknown_word_prob = 0.520415
minimum_prob_strength = 0.315104
unknown_word_strength = 0.215393

So the unknown_word_prob is now back to 0.5 again! I just committed my 
changes to the optimization code, any hints on improvements are welcome.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/