[Spambayes] question about Spambayes setup...

Fri Nov 21 01:30:48 EST 2003

[fred fredwin]
> Greetings !
> I recently came across SpamBayes and am very excited about using it
> (tried other spam filtering software, and it was a pretty big waste
> of time).  I've got it set up, and it works fine, but I want to
> know if anyone has any setup tips?

Well, everyone seems to work in a different way.  We haven't yet done
appropriate research on effective real-life training schemes, but the good
news is that just about anything works.  The most important rule of thumb
gained from experience is that things work best when the # of ham and # of
spam trained on are approximately the same.

> I was thinking of having a folder with known bad spam, which would
> not be deleted and kept in Outlook today, along with known ham,
> kept in another folder, each having about 500 messages to compare
> incoming spam to, which would periodically take new messages (ham
> and spam), put them in the respective folder, and then periodically
> delete old spam/ham messages.  This would keep an updated list of
> spam/ham, to stay on par with the spammers (they always were one
> step ahead any kind of spam filtering software).  any suggestions
> would be appreciated...

I created a distinct .pst file, which holds (just) my Ham and Spam
training-data folders.  SpamBayes is configured to move spam directly into
that Spam folder, and Unsures into a "Z UNSURE" folder in my main .pst file
("Z UNSURE" so it falls to the bottom of the tree display).  About once a
day I move all the new spam into my Deleted Items folder, which has a Spam
score column and is set to sort on that.  Z UNSURE is also set to sort on
Spam score.  That makes it easy to eyeball one end of the displays for
mistakes.  When I need to train on a ham, I drag a *copy* of it to the Ham
training folder (hold down right mouse button while dragging, and select
Copy from the context menu that pops up when you release the button).

After seeding with a few hundred msgs of each kind, you're probably going to
find that you don't need much training anymore.  This isn't a rule-based
system, so spammers can't evade it by "learning the rules" and crafting spam
to get around them.  On rare occasions they invent a new way of obfuscating
HTML that's actually effective, and I notice that by staring at the guts of
low-scoring Unsures; then I check in a change to the tokenizer to
de-obfuscate it; I think I've done that about 3 times over the past year.
Spammers really aren't getting any better at fooling this system.  This
isn't surprising, since rule-based systems dominate the commercial
spam-blocking market, and spam is a mass-market game.  SpamBayes isn't yet
worth the bother of targeting.  When you see a low-scoring spam, train on
it; the system does learn.

Anyway, there are a couple reasons I use a distinct .pst file for my
training data.  One is so that I can retrain from scratch in an eyeblink.
Since I work on the guts of this system, that's important so I can judge the
effect of changes.  It also makes it possible to recover from database
corruption in an eyeblink, but I've never seen that happen (*some* people
do -- this still isn't understood, alas).  But the primary reason I use a
different .pst is that I need to copy my main .pst file between desktop and
laptop several times per week, and it saves major "sit and wait" time to
keep that file as small as possible.  I don't bother trying to keep the
classifiers in synch across machines; they each have their own database.

Every now & again I blow away my classifier and start over from scratch.
This is just because I enjoy watching it learn.  There are also a number of
practical advantages to keeping your training database relatively small.  If
you end up maintaining a database with many thousands of ham and spam, that
has a way of turning into its own kind of time sink.

Ah!  Add a Spam score column to your Ham and Spam training folders, and sort
on that column.  Every now & again rescore all the messages in them
(SpamBayes -> Filter messages ...).  Look at "the wrong end" then (at
high-scoring training ham and low-scoring training spam).  If you made any
mistakes in classifying (everyone eventually does!), they'll almost always
show up at the wrong end.  The worst thing you can do with this kind of
system is train a message into the wrong category -- it has no predefined
notions of what "ham" and "spam" are, and believes whatever you tell it to
believe.  That's mostly a great strength, but it's also a great weakness
when you make a mistake.

Caution:  Don't get too complacent too soon.  Many kinds of commercial email
you want (whether ordering products online, or getting a flashy company
newsletter) have a lot in common with most peoples' idea of what spam is,
because they're all trying to sell you something.  The language of
advertising is distinctive, and commercial email you want is quite likely to
score as a high Unsure, or even as Spam, the first time or two you get one
from a given company.  Just keep your eyes open and train accordingly.

Also remember that it doesn't matter what anyone *else* insists "spam" is.
For example, I get some commercial email from companies that have a good
case for sending it to me (for example, I was a customer, or even signed up
for a newsletter once), but that I don't want anymore.  I just tell my
classifier it's spam, and I'm not bothered with it anymore.  There are also
some kinds of dubious messages I enjoy (spam *can* be highly entertaining,
in a train-wreck sort of way), and I tell my classifier those are ham.
That's OK with SpamBayes -- it's happy to classify any way you tell it to.