[Spambayes] Does SpamBayes support automatic selective training?

Jesse Pelton jsp at PKC.com
Thu Jan 10 13:54:58 CET 2008


It's usually just a general sense that filtering is no longer as
effective as it once was, often combined with a (relatively) high ratio
of spam to ham in the training database and/or a relatively high number
of messages in the database.  For instance, I just tossed my training
database a couple of days ago, when it had a few hundred messages and a
spam:ham ratio of about 3:1.  I'm now getting filtering results that are
almost as good with 7 ham and 5 spam in the database, and I expect
results will improve to the point that I think I'm ahead within a day or
two.

All very subjective, seat-of-the-pants, and possibly delusional.  If I
had the time, interest, and expertise, it might be interesting to
quantify my results, but I'm just an Outlook plug-in user trying to make
my mail stream manageable.  I've managed to keep myself convinced that
this approach is working for several years now, though.

As for my perverse pleasure, that stems from marveling at how quickly
SpamBayes learns, from keeping things lean, and from the sense that I'm
spending less time manually classifying messages, once I reach that
point.

I wasn't necessarily recommending that Ram trash his training
periodically, though.  I just wanted to make the point that a small set
of really good data may be better than a big set of data of questionable
quality, and to suggest that he try incremental training before trying
to figure out how to turn his existing set of messages into an effective
training corpus.

-----Original Message-----
From: spambayes-bounces+jsp=pkc.com at python.org
[mailto:spambayes-bounces+jsp=pkc.com at python.org] On Behalf Of David
Abrahams
Sent: Wednesday, January 09, 2008 7:04 PM
To: spambayes at python.org
Subject: Re: [Spambayes] Does SpamBayes support automatic selective
training?


on Thu Jan 03 2008, "Jesse Pelton" <jsp-AT-PKC.com> wrote:

> Do you have reason to believe that incremental training on messages
that
> you're currently receiving would be ineffective?  I retrain from
scratch
> periodically, and I generally find that a remarkably small corpus
(maybe
> a total of couple of dozen messages trained) is effective.  I retrain
in
> part because I suspect that the content of spam that I receive changes
> over time, so training performed on messages from the distant past
(say,
> six months ago) may be irrelevant or worse for my current message
> stream.
>
> One of the counter-intuitive things about SpamBayes is how little data
> it needs to go on.  This makes retraining fast, easy, and (for me, at
> least) perversely rewarding.

Sorry if this sounds combative; I'm really just trying to understand.

What makes you decide to retrain, if it's working so well?  Do you
just do it prophylactically, like brushing your teeth?  If so, then
you probably don't see it improving things much (like brushing your
teeth).  In that case, what makes it rewarding?

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

_______________________________________________
SpamBayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html


More information about the SpamBayes mailing list