OT: spam filtering idea

Tue Jan 14 21:47:18 EST 2003

[Paul Rubin]
> ...
> Spambayes is already working better than spamassassin?  Wow.

It depends on what you use it for.  It was intended to be used by a single
person on their own email, and it quickly learns so much about a single
person's quirks that even very early versions of the spambayes code did at
least as well as a well-maintained SpamAssassin.

For group use your mileage will vary.  General single-classifier tests on
all the email traffic going thru python.org tried to exclude personal email
accounts, leaving just the Python and Zope mailing list traffic, and a
number of small, private, special-interest mailing lists.  We know how well
spambayes did in those bests, but aren't so sure about how well SA did; I do
know it caught a lot of spam that got beyond SA.  OTOH, python.org rejects
many msgs before SA ever sees them, so we have no idea how either system
would do on those.

The tests turned up one class of msg where SA had a real advantage:  very
brief administrivia requests to *-request addresses.  The ones that caused
trouble were typically a single-word msg like "unsubscribe" (itself a word
with high spamprob!), followed by a forward of a spam or off-topic
"conference announcement" that had leaked thru on the mailing list, *and*/or
a dozen of kilobytes of employer-generated HTML disclaimers ("whirlygigs.com
is a regulated investment company, and is not responsible for the etc etc
etc").  An appreciable fraction of a percent of administrivia msgs look like
that.  SA did better on those because python.org's SA installation is tuned
to give a huge "ham boost" to any email sent to a *-request address.
spambayes has no gimmicks like that.

Some of the personal email that snuck thru was also troublesome.  Everyone
signs up for *some* HTML newsletters that most other people would consider
to be spam.  Train a single classifier to accept the financial newsletters I
want to see, and the classifier becomes weaker at weeding out "similar"
stuff for other people.  Or if you happen to be resigned to the size of your
trouser snake and would rather not be reminded of it, training a shared
classifier to reject penis-enlargement spam stops Barry from getting the
help he so desperately needs.

> I guess I'll look into switching.  It's seemed to me up til now that
> it really takes a mixture of dynamic (Bayesian)

There's really nothing Bayesian about the spambayes code, except for a
Bayesian adjustment to the estimates of individual words' spam
probabilities.  The probability combining scheme isn't Bayesian at all.  An
article by Gary Robinson about the math behind spambayes will be published
in Linux Journal soon, followed the next month with an article by Richie
Hindle about the more practical aspects of the system.

> and hand-coded (SA) filtering

For use by a group of unrelated individuals (say, an ISP, or corporate email
server), I expect that's true.

> I've heard the next version of SA will incorporate Bayesian filtering
> in addition to what it already does.

SpamAssassin's Matt Sergeant hung out on the spambayes mailing list for
quite a while, and picked up some number of the techniques for SA's use.
More power to 'em, although I no longer have a" spam problem" so stopped
paying attention <wink -- but I still get about 100 spam a day, and it all
ends up in my spam folder now>.