[Spambayes] train-to-exhaustion questions
David Abrahams
dave at boost-consulting.com
Fri Apr 27 00:19:59 CEST 2007
on Thu Apr 26 2007, skip-AT-pobox.com wrote:
> David> 1. A recent training run went like this:
>
> David> round: 1, msgs: 690, ham misses: 61, spam misses: 210, 176.3s
> David> round: 2, msgs: 690, ham misses: 8, spam misses: 53, 165.6s
> David> round: 3, msgs: 690, ham misses: 1, spam misses: 7, 159.6s
> David> round: 4, msgs: 690, ham misses: 1, spam misses: 2, 159.6s
> David> round: 5, msgs: 690, ham misses: 0, spam misses: 1, 157.8s
> David> round: 6, msgs: 690, ham misses: 1, spam misses: 1, 160.9s
> David> round: 7, msgs: 690, ham misses: 0, spam misses: 1, 211.0s
> David> round: 8, msgs: 690, ham misses: 0, spam misses: 1, 172.6s
> David> round: 9, msgs: 690, ham misses: 0, spam misses: 1, 197.1s
> David> round: 10, msgs: 690, ham misses: 1, spam misses: 1, 174.6s
>
> David> It seems that the results got *worse* in rounds 6 and 10. Am I
> David> misinterpreting this? Are these expected results?
>
> I would look through your log files, probably near the end, to see if there
> is some message that is either a mistake
Yep, there was a mistake (or two) upon closer inspection.
> David> 2. I have about 350 each of ham and spam that I can use to train
> David> on. I'm sure that some of these messages are mostly redundant
> David> and add little or nothing of value to the training data. I
> David> don't want to waste time on them every time I do a training
> David> run. Is there some way to use tte.py to reduce my training
> David> set to the messages that actually make a difference?
>
> Many of them will train correctly the first time through. The tte script
> should not write them out at the end.
Oh, so *that's* what the -c option does! The documentation isn't very
clear on first reading; it seems to imply that it saves the ones that
score correctly the first time through, when actually, it discards
them (did I get that right)? I'd be happy to suggest less-ambiguous
wording.
> I run tte.py via a shell script wrapper called "tte" (attached) and
> currently run it like so:
>
> cd ~/tmp
> mv newham.old.cull newham.old
> mv newspam.old.cull newspam.old
> touch newham
> touch newspam
> HC=0.02 SC=0.98 RATIO=1:1 tte
>
> new{ham,spam}.old.cull are the files written by tte itself.
You mean, as a result of passing '-c'?
> newham and newspam are the messages I've saved from my mailer since
> the last run.
Meaning, the new unsure (and, god forbid, misclassified) messages?
OK, I'll look at this; thanks. Probably a few more changes are
warranted to support this method for IMAP users.
Here's another question: is the ratio argument really the best
interface? Seems to me that if you keep the number of hams and spams
very close to one another, specifying a ratio that uses all the
training data is very difficult (you have to count all the messages
manually). Wouldn't it be better to have an --unbalanced argument
that automatically counts and causes all the training data to get used?
And another: the purpose of the -R argument wasn't clear to me, but I
started using it on the assumption that when things get slightly
out-of-balance I was likely to miss training on the newest data if the
algorithm started at the beginning of the mailbox. Is that the
intended use?
--
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com
Don't Miss BoostCon 2007! ==> http://www.boostcon.com
More information about the SpamBayes
mailing list