[Spambayes] More experiments with weaktest.py

Mon Nov 11 09:12:57 2002

Tim Peters wrote:
> [Rob Hooft]
>>...
>>Hm. That sounds so enthousiastic that I just might commit what I have
>>gone through this night.
> 
> 
> You did, and I thank you!  Note that there were already three Simplex pkgs
> linked from
> 
>     http://www.python.org/topics/scicomp/numbercrunching.html
> 
> but I know how much fun it is write such stuff again <wink>.

Yeah, but on the other hand, all those people didn't have access to my 
module when they wrote theirs, because it wasn't publicized ;-) [Let me 
add that my optimize code dates from late 1997]

>>  * I designed a new "Flex cost" field. That one does away with the
>>    "unsure cost". The cost of a message is 0.0 at its own cutoff, and
>>    increases linearly towards its "false" cost at the other cutoff,
>>    and increases further to the other end. Hm. Unreadable.
> 
> 
> The code is clear enough, though.  What I didn't understand is why each term
> in the flexcost is divided by the difference between the (fixed per run)
> cutoff levels:   / (SPC - HC).  That seems to systematically penalize, e.g.,
> ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and
> spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8).
> In the limit, if someone wanted a binary classifier (ham_cutoff ==
> spam_cutoff), any mistake would be charged an infinite penalty.

You're right.
> 
> 
>>A table:
>>
>>           Score    Spam with this   Ham with this
>>                      score costs     score costs
>>            0.00         $ 1.29          $ 0.00
> 
> 
> It's hard to see where that comes from.  Assuming ham_cutoff is 0.2 and
> spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 *
> (.9-0.0)/(.9-.2) ?

Yes.

> 
>>            0.20         $ 1.00          $ 0.00
>>            0.55         $ 0.50          $ 5.00
>>            0.90         $ 0.00          $10.00
>>            1.00         $ 0.00          $11.43

But you're right that it would be better to make:

            Score    Spam with this   Ham with this
                       score costs     score costs
             0.00         $ 1.00          $ 0.00
             0.20         $ 1.00          $ 0.00
             0.55         $ 0.50          $ 5.00
             0.90         $ 0.00          $10.00
             1.00         $ 0.00          $10.00

i.e. both functions consist of 3 linear segments rather than 2.

> Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a
> DivideByZero exception, so it's sure sensitive there <wink>.  I suspect it
> might work better if the "/(SPC-HC)" business were simply removed?

That would no longer satisfy the constraints I put in.

> I've been running weakloop.py over two sets of my c.l.py data while typing
> this.  That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total
> msgs.  It's been thru the whole business about 25 times now.  At the start,
> 
> Trained on 88 ham and 66 spam
> fp: 0 fn: 0
> Total cost: $30.80
> Flex cost: $212.3120
> x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31
> 
> It's having a hard time doing better than that.  The best so far seems to be
> 
> Trained on 82 ham and 66 spam
> fp: 0 fn: 0
> Total cost: $29.60
> Flex cost: $200.0924
> x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09
> 
> which is so close to the starting point that it's hard to believe it's
> finding something "real".  It *does* seem to be in a nasty local minimum,
> though, as the next attempt was:
> 
> Trained on 118 ham and 69 spam
> fp: 1 fn: 0
> Total cost: $47.20
> Flex cost: $344.7334
> x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73
> 
> I'm afraid it looks like it's eventually going to converge on the most
> delicate possible settings that barely manage to avoid that 1 FP.

This is exactly what I found so far, even with my complete data set. It 
is too delicate to work. Now this could be due to 2 things:

  1. The flexcost is still causing lots of false minima
  2. The weaktest is causing lots of false minima

I suspect the latter, because it contains lots of "yes/no" decisions 
that may tuble the other way with minimal changes in the parameters.

My conclusion is to stop this, and try the optimization on something 
like timtest.py but with the flexcost as target function. Or maybe 
change weaktest such that it trains on all messages in the process. That 
would simulate the "optimal" strategy of a user that has to start from 
nothing.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/