spam classification breaker

John Graham-Cumming google at jgc.org
Thu Feb 5 17:16:00 EST 2004


Skip Montanaro <skip at pobox.com> wrote in message news:<mailman.1251.1076002061.12720.python-list at python.org>...
> Mr. Graham-Cumming could have avoided the overhead of sending himself 10,000
> mails by simply selecting words from his archived public presence on the
> net: web pages, Usenet posts or archived mailing list posts associated with
> his email address.  I suspect his genetic algorithm would have been all but
> unnecessary.  (Google for "John Graham-Cumming" for example.)
> 
> This doesn't have to be a tedious process either.  In the course of normal
> scumbag email harvesting, all the crawler has to do is select a few
> non-trivial words from the harvested page and associate them with the email
> address(es) on that page.  After seeing the same email address a few times
> they would have a decent collection of hammy words for use in the "random
> words" block of later spam.

Yes, and I've tested this and its possible to find hammy words this
way too, although it wasn't as effective as the technique I pointed
out, nevertheless it is practical and in my experiments I looked at
the uncommon words found in the locus of my email address and around
40% were pure ham!
 
Another way would be to spider the web page associated with the domain
in the email address. e.g. to attack my address spider www.jgc.org.

All of this indicates that it should be possible to attack Bayesian
filters with a variety of techniques that rely on the fact that they
are naive (i.e. they'll accept a hammy word no matter where it
appears).

John.



More information about the Python-list mailing list