[Spambayes] Tokenising clues

Tue, 01 Oct 2002 16:07:53 -0400

[Matt Sergeant]
> ...
> We have a much more robust mailman detector already. And that's my point
> - a spammer can get around your naive "mailman detector" with a bunch of
> underscores anywhere in his message, but he has to work a lot harder to
> get around a more robust detection system (it's not invincible, but it
> would probably require him modifying his software).

Matt, we don't have *any* "mailman detector", and that's a key point.  We
generate "skip" tokens for every string longer than 12 chars, and that it
happened to catch a Mailman clue is pure luck.  It's not trying to *do*
anything specific.  We catch so many "Mailman clues", in fact, that I dare
not look at most of the header lines in my mixed-source data -- the Mailman
clues it picks up purely by luck then are too strong.

As to a spammer trying to exploit it, not a problem.  No single word can
determine the outcome, and if spammers take to putting '-'*40 in their spam,
the system will learn to disregard it.

I've done this experiment:  I ran my fat test, looked at the list of the top
50 discriminators, and purged them all from the database.  Then I ran my fat
test again.  The performance wasn't significantly worse.  If one set of
clues becomes worthless, it finds another set.  So long as spam is trying to
sell you something, "it's different".

> So give the dog (spambayes) a bone. Let it eat all the information
> you can give it.

This is fine, provided it doesn't bloat the database size, or increase
classification time, without a compensating measurable improvement in
results.  Part of the tokenizer is as finicky as it is because I'm aiming to
keep size and time requirements in bounds too (so, e.g., I deliberately
don't tokenize Content-Transfer-Encoding, and note the presence or absence
of an Organization line but without tokenizing its value:  experiments
showed that what I *do* do in these cases helped, but that the parts I left
out did not help).

> None of it is going to hurt, or if it does you can chuck that out like
> you have been doing for a few weeks already with other tokenising ideas!

As a general rule, I add things that help, not add in lots of ideas at once
and then throw things out that don't help.  Our results steadily progress in
the right direction, so I'm going to stick with what works.