[Spambayes] train-to-exhaustion questions

David Abrahams dave at boost-consulting.com
Fri Apr 27 00:19:59 CEST 2007


on Thu Apr 26 2007, skip-AT-pobox.com wrote:

>     David> 1. A recent training run went like this:
>
>     David>   round:  1, msgs:  690, ham misses:  61, spam misses: 210, 176.3s
>     David>   round:  2, msgs:  690, ham misses:   8, spam misses:  53, 165.6s
>     David>   round:  3, msgs:  690, ham misses:   1, spam misses:   7, 159.6s
>     David>   round:  4, msgs:  690, ham misses:   1, spam misses:   2, 159.6s
>     David>   round:  5, msgs:  690, ham misses:   0, spam misses:   1, 157.8s
>     David>   round:  6, msgs:  690, ham misses:   1, spam misses:   1, 160.9s
>     David>   round:  7, msgs:  690, ham misses:   0, spam misses:   1, 211.0s
>     David>   round:  8, msgs:  690, ham misses:   0, spam misses:   1, 172.6s
>     David>   round:  9, msgs:  690, ham misses:   0, spam misses:   1, 197.1s
>     David>   round: 10, msgs:  690, ham misses:   1, spam misses:   1, 174.6s
>
>     David>   It seems that the results got *worse* in rounds 6 and 10.  Am I
>     David>   misinterpreting this?  Are these expected results?
>
> I would look through your log files, probably near the end, to see if there
> is some message that is either a mistake 

Yep, there was a mistake (or two) upon closer inspection.

>     David> 2. I have about 350 each of ham and spam that I can use to train
>     David>    on.  I'm sure that some of these messages are mostly redundant
>     David>    and add little or nothing of value to the training data.  I
>     David>    don't want to waste time on them every time I do a training
>     David>    run.  Is there some way to use tte.py to reduce my training
>     David>    set to the messages that actually make a difference?
>
> Many of them will train correctly the first time through.  The tte script
> should not write them out at the end.

Oh, so *that's* what the -c option does!  The documentation isn't very
clear on first reading; it seems to imply that it saves the ones that
score correctly the first time through, when actually, it discards
them (did I get that right)?  I'd be happy to suggest less-ambiguous
wording.

> I run tte.py via a shell script wrapper called "tte" (attached) and
> currently run it like so:
>
>     cd ~/tmp
>     mv newham.old.cull newham.old
>     mv newspam.old.cull newspam.old
>     touch newham
>     touch newspam
>     HC=0.02 SC=0.98 RATIO=1:1 tte
>
> new{ham,spam}.old.cull are the files written by tte itself.  

You mean, as a result of passing '-c'?

> newham and newspam are the messages I've saved from my mailer since
> the last run.

Meaning, the new unsure (and, god forbid, misclassified) messages?

OK, I'll look at this; thanks.  Probably a few more changes are
warranted to support this method for IMAP users.

Here's another question: is the ratio argument really the best
interface?  Seems to me that if you keep the number of hams and spams
very close to one another, specifying a ratio that uses all the
training data is very difficult (you have to count all the messages
manually).  Wouldn't it be better to have an --unbalanced argument
that automatically counts and causes all the training data to get used?

And another: the purpose of the -R argument wasn't clear to me, but I
started using it on the assumption that when things get slightly
out-of-balance I was likely to miss training on the newest data if the
algorithm started at the beginning of the mailbox.  Is that the
intended use?

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Don't Miss BoostCon 2007! ==> http://www.boostcon.com



More information about the SpamBayes mailing list