spam classification breaker

Tim Peters tim.one at comcast.net
Thu Feb 5 18:43:30 EST 2004


[Robin Becker]
> .... are you asserting that spammers don't have access to the pdf that
> users are filtering?

Sorry, I couldn't make sense of that question.

> Each filter may be unique, but they can be biassed. --

It doesn't matter, because these classifiers learn.  In the early days of
the Spambayes project, we experimented with throwing "the best" N clues
(both hammy and spammy) out of the database, where "the best" was a measure
of how often and how strongly a feature contributed to a correct
classification.  Through several iterations of that, overall performance
remained just as good -- the classifier learned to look for other things.

If even the strongest features can be thrown away without harm, there's not
much use in trying to exploit small statistical bias.  It's not even clear
that any particular individual bias is widespread.  For example, "Nancy" is
a hammy word in my training data, but "Cecil" is spammy.  Is that universal?
Seems unlikely.  "Python" is very hammy for me, but is probably at best
neutral for most people; it may even be strongly spammy for most people,
thanks to <http://www.python.com>'s advertising.  Etc.  The details of your
personal email life may be as unique as a fingerprint.





More information about the Python-list mailing list