[Spambayes] An alternate use

Sat Nov 2 05:32:58 2002

[T. Alexander Popiel]
> A couple things have been kicking around in my head, and they've
> managed to come together in an interesting configuration and stick,
> so I'm going to make a quiet little proposal and see how much
> thunder it generates.
>
>
> First off, the observations:
>
> 1. Based on recent reports, spambayes works better when given full
>    data about everything that comes through, not just the mistakes.
>    This is predicted by the theory, too.

I'd say "representative data" more than "full data".  A random slice of real
life, consistently applied, should be enough.

> 2. spambayes is extremely sensitive to changes in the nature of
>    ham, and is moderately likely to classify any new topics/venues
>    as spam.

Almost certainly true for a classifier trained mostly by mistakes, ignoring
the correctly classified msgs.  The latter are needed to transform spamprobs
from serendipitous hapaxes into robust indicators.

In my own classifier, I trained on *no* msgs from the Spambayes list at
first.  I left them out on purpose.  Recall that I reported on what happened
after I had a pretty decent classifier and scored more than 1,000 backed-up
spambayes msgs:  they were almost all scored as ham, despite not training on
the topic at all.  I expect this is more rule than exception for a properly
trained classifier.

What it *is* extremely sensitive to is advertising you sign up for.  I've
been at this thru a full billing cycle now, and marketing msgs from vendors
I want to do business with still score as Unsure before training on several
msgs from a specific vendor.  Spam that uses the same words can keep
knocking them back into Unsure territory too.

> 3. spambayes is still a techie toy (though perhaps not for much
>    longer).  People with a little knowhow are going to have a
>    much easier time training it than the average joe.

Absolutely.

> 4. We want a large penetration into the mail-reading populace,
>    to better force the spammers to change tactics.

Heh.  It's still an irony of this project that I've never particularly
minded getting 100 spam per day <wink>.

> 5. Many people read mailing lists.  In fact, for high volume
>    mail users, mailing lists probably make the majority of
>    their incoming mail (or at least their incoming ham).

True here.

> 6. A noticable amount of spam gets relayed through mailing lists,
>    and most personal filters are notoriously bad about passing
>    it through because it comes from a whitelisted intermediary.

Indeed, that's why I still ignore most of the header lines.  python.org and
Mailman put so many "I touched this!" clues in the headers, and do such a
good job of stopping spam already, that if I pay attention to those clues
then almost none of the spam they let pass gets caught.

> 6. Most mailing lists keep archives of everything sent over the
>    list.

Yup.

> 7. Most mailing lists are single-topic, and anything off-topic
>    is unwanted.

Eh -- probably.  I started with the mailing-list version of
comp.lang.python, and there's a huge amount of traffic there that never
mentions Python.  The variety of ham on that group is quite amazing.  But it
contains almost no advertising beyond conference announcements, and I still
expect that accounts for the breathtaking results I get on my c.l.py tests
(2 mistakes out of 34,000 msgs, where one "mistake" is saying that a quote
of a full Nigerian-scam spam is itself spam).

> So, what I propose is that we specifically target mailing list
> managers (mailman and ecartis being the two obvious first
> targets) for spambayes integration.  I see two main modes for
> this: just adding headers for the less intrusive, and actually
> rejecting or forcing moderation for the heavily policed.

That's actually what started this project:  Barry Warsaw is GNU Mailman's
author, and he asked me to look into adapting Graham's scheme for
incorporation into Mailman.  Barry has been pretty much missing in action
here since then, but I expect him to take it up again.

> Training is easily accomplished by taking the list archives
> as a ham corpus and one of the spam collections floating
> around as a spam corpus.

That's exactly what I did, and it was anything but easy.  Mixed-source
corpora create a world of problems, and Mailmain archives in particular save
*all* the Mailman distortions introduced into the headers.  Even on the more
general "python.org email" test I've been doing behind the scenes lately,
the headers are polluted by judgments from SpamAssassin, and goofy little
things like python.org's MTA inventing Message-Id lines out of thin air when
one doesn't come across on the wire.  There are lots and lots of traps here.

> Run the classifier over the training data to kick out all the false
> positives and false negatives for possible resorting, then retrain.
> Only the list owner has to be techie to do this, and list owners are
> more likely to be techie than not (they set up a mailing list, after
> all).  Periodic retraining can be handled in the same way.
>
> In the case of adding headers, we'll want to avoid collisions
> with personal use of spambayes, too.  I suggest tagging the
> X-Spambayes-Disposition header (or whatever we call it) with
> some identifier for which classifier generated the rating,
> so that multiple X-Spambayes-Disposition lines are distinguishable.
> Something like:
>
>   X-Spambayes-Disposition: Spam by spambayes@python.org
>   X-Spambayes-Disposition: Unsure by pennmush@pennmush.org
>
> Personal classifiers could leave off the 'by' section.
>
> Heck, make it so that X-Spambayes-Disposition lines are turned
> into words similar to the mailer lines, and then personal
> classifiers can use the judgements of list classifiers as clues.

Easy to spoof, and I'm sure spammers would pick up on that quickly.

> Doing this sort of integration into mailing list managers takes
> advantage of some 'weaknesses' of spambayes, and could be of
> great benefit to many people beyond just those with the
> wherewithal to train and run the filter.

That was Barry's idea, yes <wink>.  I'll leave it to him to resume this
battle.  One idea we kicked around was to add a

    If this looks like spam, click here:  http://yadda.yadda.yorg/abc?=etc

line at the bottom of each mailing-list msg.  An automated system on the
server would collect and organize votes.  There's no intention that users
get to vote on what *is* spam, the real point is more devious:  a msg that
*nobody* claims is spam almost certainly isn't spam, so it's really most
valuable as a way to identify ham.  That is, if nobody claims msg X is spam
within a few days, it's almost certainly the case that X is safe to add to
the ham training.  That seems so certain that it could be automated.  Msgs
that got "weveral" spam votes would be brought to the list admin's
attention, for human judgment about whether to classify them as errors.
Automating *that* part gets too close to censorship-by-vocal-minority for my
tastes, so if Barry implemented that part I'd kill him <wink>.