[Spambayes] To think like a spammer...

Sat, 28 Sep 2002 20:34:21 -0700

* Guido van Rossum <guido@python.org> [2002-09-28 22:33:13 -0400]:
> > The spambayes scheme (and others like it that I've seen) can be defeated
> > easily, with something like this...
> > 
> > THIS  IS  A   F A N T A S T I C   O P P O R T U N I T Y ! !
> 
> It's an arms race.  I expect that the classification scheme can be
> kept relatively constant (the math doesn't change) but the tokenizing
> (better called feature extraction) scheme can and should be adapted
> occasionally, to deal with new ways of hiding spam.  This particular
> style can easily be recognized[*] *if* it becomes popular among
> spammers; for anything you can come up with there's a tokenizer that
> recognizes it.

I agree that this will eventually become a problem of lexical analysis.

> But spammers will only start worrying if their return rates go down,
> and that will only happen once almost everybody is using anti-spam
> technology.  We've got a long way to go before that's the case.  So
> let's not be stymied by worries about what the spammers can do.
>
> [*] I wouldn't even bother trying to recover the words FANTASTIC
> OPPORTUNITY.  This style is so completely unseen in ham that simply
> looking for many consecutive one-letter words and inserting a token
> representing such a presence would most likely be enough.

True enough that this isn't seen in ham at all.  However, a single
token which recognizes it can be overpowered by all the rest.

I had to laugh when Tim posted a spam that had slipped through all
previous tests in <LNBBLJKPBEHFEDALKOLCGEHEBHAB.tim.one@comcast.net>.
That is exactly the kind of message I'm talking about.

Regards,

-- 
Mark M. Hoffman
mhoffman@lightlink.com