[spambayes-dev] RE: [Spambayes] How low can you go?

Fri Dec 26 21:13:05 EST 2003

[Seth Goodman]
> ...
> Which place in the SpamBayes manager is the one that changes the
> config that export.py uses?  There are ham and spam folder
> specifications in more than one place:  filtering, training and
> watched folders at least, there may be more.

Training.  This will become clear when you run export.py, since it displays
the names of the folders it's exporting.  Don't hesitate to run export.py.
It doesn't change your .pst files in any way -- it's harmless, and the files
it creates can be thrown away at will.

> ...
> One thing I do that may or may not be typical is that I let Outlook
> rules take care of all the mailing list traffic.  That includes
> almost no spam and so I don't train or classify it (the list admins
> do a good job).  Therefore, I _don't_ include it in my ham corpus.
> This gives me a roughly 1:5 ham/spam corpus, instead of roughly even,
> but that's the mail stream that SpamBayes sees.

Yet it remains possible that the best training strategy for your mix
requires artificially forcing a particular ratio.  Picture an extreme:  if
your actual incoming ratio is a million to one ...

> I _do_ make sure the training sets have equal numbers of messages.  At
> present, my corpus is about 7,500 messages total.  This may not be
> enough to "divide into ten sets", etc.  Or is it?

It's plenty.  The last multi-corpus "death match" experiments here required
that participants use exacty 10 sets of ham and 10 sets of spam, each set
having exactly 200 messages.  That's a grand total of 4,000 msgs.

However, it's not clear *what* to test anymore.  At the start, this project
was aimed at high-volume mailing lists, where the admins were thought most
likely to train on giant sets of ham and spam a few times per year.
Randomized cross-validation testing is a fine approach for that use.

There are apparently only a few people who use spambayes that way, though,
and among the rest of us no two seem to train in the same way.  Incremental
training, and preserving the order in which messages arrive, seem
overwhelmingly more interesting to most real users.

So what may be more important now, building on Alex's incremental testers,
isn't the sheer number of messages so much as the span of time they cover.
Indeed, for new users, it's important to know how this filter behaves after
training on just a few messages.  That's my particular interest with the
experimental mixed unigram/bigram scheme:  the hope is that it "learns
faster".  In earlier tests, I never found anything that beat the pure
unigram scheme *given enough training data*, but few users have 20,000
recent ham and spam to start off with.

OTOH, I don't have enough exhaustive personal email saved away to measure
anything other than how the system performs across a few days, and a scheme
that "learns fast" starting from nothing *may* also be slow to adapt to
changes over time (we all know a bright kid who never outgrew their
6th-grade worldview, right <wink>?).

Oh well.  There have always been more ideas to test than were possible to
cover.