[Spambayes] train-to-exhaustion questions

David Abrahams dave at boost-consulting.com
Fri Apr 27 00:19:59 CEST 2007

on Thu Apr 26 2007, skip-AT-pobox.com wrote:

>     David> 1. A recent training run went like this:
>     David>   round:  1, msgs:  690, ham misses:  61, spam misses: 210, 176.3s
>     David>   round:  2, msgs:  690, ham misses:   8, spam misses:  53, 165.6s
>     David>   round:  3, msgs:  690, ham misses:   1, spam misses:   7, 159.6s
>     David>   round:  4, msgs:  690, ham misses:   1, spam misses:   2, 159.6s
>     David>   round:  5, msgs:  690, ham misses:   0, spam misses:   1, 157.8s
>     David>   round:  6, msgs:  690, ham misses:   1, spam misses:   1, 160.9s
>     David>   round:  7, msgs:  690, ham misses:   0, spam misses:   1, 211.0s
>     David>   round:  8, msgs:  690, ham misses:   0, spam misses:   1, 172.6s
>     David>   round:  9, msgs:  690, ham misses:   0, spam misses:   1, 197.1s
>     David>   round: 10, msgs:  690, ham misses:   1, spam misses:   1, 174.6s
>     David>   It seems that the results got *worse* in rounds 6 and 10.  Am I
>     David>   misinterpreting this?  Are these expected results?
> I would look through your log files, probably near the end, to see if there
> is some message that is either a mistake 

Yep, there was a mistake (or two) upon closer inspection.

>     David> 2. I have about 350 each of ham and spam that I can use to train
>     David>    on.  I'm sure that some of these messages are mostly redundant
>     David>    and add little or nothing of value to the training data.  I
>     David>    don't want to waste time on them every time I do a training
>     David>    run.  Is there some way to use tte.py to reduce my training
>     David>    set to the messages that actually make a difference?
> Many of them will train correctly the first time through.  The tte script
> should not write them out at the end.

Oh, so *that's* what the -c option does!  The documentation isn't very
clear on first reading; it seems to imply that it saves the ones that
score correctly the first time through, when actually, it discards
them (did I get that right)?  I'd be happy to suggest less-ambiguous

> I run tte.py via a shell script wrapper called "tte" (attached) and
> currently run it like so:
>     cd ~/tmp
>     mv newham.old.cull newham.old
>     mv newspam.old.cull newspam.old
>     touch newham
>     touch newspam
>     HC=0.02 SC=0.98 RATIO=1:1 tte
> new{ham,spam}.old.cull are the files written by tte itself.  

You mean, as a result of passing '-c'?

> newham and newspam are the messages I've saved from my mailer since
> the last run.

Meaning, the new unsure (and, god forbid, misclassified) messages?

OK, I'll look at this; thanks.  Probably a few more changes are
warranted to support this method for IMAP users.

Here's another question: is the ratio argument really the best
interface?  Seems to me that if you keep the number of hams and spams
very close to one another, specifying a ratio that uses all the
training data is very difficult (you have to count all the messages
manually).  Wouldn't it be better to have an --unbalanced argument
that automatically counts and causes all the training data to get used?

And another: the purpose of the -R argument wasn't clear to me, but I
started using it on the assumption that when things get slightly
out-of-balance I was likely to miss training on the newest data if the
algorithm started at the beginning of the mailbox.  Is that the
intended use?

Dave Abrahams
Boost Consulting

Don't Miss BoostCon 2007! ==> http://www.boostcon.com

More information about the SpamBayes mailing list