[Spambayes] Any prospect of spambayes working with qmail?

Fri Feb 14 10:16:03 EST 2003

From: Neale Pickett [mailto:neale at woozle.org]
> You can probably set up a pretty good wordlist by training on a week's
> worth of collected ham and spam--less if you're a bigger site. But unless
> you constantly retrain it, your accuracy will gradually degrade. You have
> to keep retraining the classifier as your spam and ham change in nature.

<alert>Rambling philosophical post ahead</alert>

This is still my biggest worry with ongoing use of Spambayes, from an end
user point of view. I have 2 setups, one at work using Outlook with the
plugin, and one at home using pop3proxy.

Both work fine at the moment, but I'm starting to suspect there's an
increase in the number of unsures I'm getting (still no significant FPs or
FNs, so to some extent this is all in the noise...)

[Feature request: Would it be possible or useful for pop3proxy to maintain
stats on numbers of unsures/ham/spam per day or whatever? Could be useful
for ongoing review...]

My ham:spam ratio at work is probably around 1:1, so that's not too bad.
At home, though, ham:spam is something ridiculous like 1:25 (I filter out
mailing list traffic into a local newsserver before spambayes gets a look
in).

At work, I train using the natural Outlook plugin approach, which is
basically training on unsures only. My DB at work has about 8000 ham
and 6000 spam. At home, I train on everything (basically, I regularly
go through the pop3proxy web interface and train on all the outstanding
messages (I never mark anything as "discard"). I don't have the DB size
figures from home with me, but I think the training DB is very spam-heavy.

Both database have Tim's "experimental spam/ham imbalance flag" set to the
default (true, I believe). I don't know whether that's going to matter, but
I worry it might start devaluing spam clues at home, where I have so few
ham to compare with.

I dunno, this really isn't much more than rambling. I have no stats to
prove anything, and no real complaints. I'm just spoiled - things are so
much better nowadays that reviewing 3 or 4 unsures is a great chore...

I know there have been some experiments in the past done on training
methods, and they were basically inconclusive (IIRC). I guess what I'm
wondering is whether there's anything new to say on the matter now that
people have been running spambayes "for real" for a decent time. One
possibility I'd thought of is to do intermittent training - start with an
empty database (or maybe one preseeded with a small representative message
base), then train for a week or two (which will tune the DB a bit. Then
stop training for a while (a couple of months) and then train on everything
for a week. Repeat the stop/train cycle. The idea being that this would
catch new spam techniques, without needing too much ongoing training. The
downside is that I can see no way of testing this approach.

Hey - if I wrote up a small document on the various possible training
methods (there aren't that many that I can think of) would that be of
any use for the documentation?

Any thoughts?
Paul.