OT: spam filtering idea

Paul Wright -$P-W$- at verence.demon.co.uk
Mon Jan 13 15:07:45 EST 2003


In article <7xr8bhhy0n.fsf at ruckus.brouhaha.com>,
Paul Rubin  <phr-n2002b at NOSPAMnightsong.com> wrote:
>The next step is to collect the frequency statistics at various
>honeypots around the net, automatically combining them and transfering
>them to public databases.  Your filter can then retrieve new
>statistics over the net every few hours.  Any spam you receive will
>probably also hit a honeypot at about the same time that you get it.
>So since the statistics you've retrieved are weighted for the latest
>and freshest spam, you should be able to kill it very effectively.

I thought the point of Bayesian filtering was that it learned about your
spam and your legitimate email, so that learning what other people
considered spam wouldn't be as effective. I'm no expert on this, though.
I expect Tim Peters will be along in a minute :-)

<http://www.jerf.org/irights/2002/11/18.html> argues that human malice
can and will defeat Bayesian filters, and that widespread adoption of
them will end up making spam harder to recognize by hand. I'm a little
concerned that the author of the article overestimates the intelligence
of spammers, but I suppose there's a selection pressure on them to get
more cunning as time goes on. The people who successfully spam my
Hotmail spam trap these days are certainly getting cleverer, presumably
in response to Brightmail filtering.

A system which works by reporting mail to honeypots would be better off
reporting hashes of message bodies to something like Vipul's Razor or
the Distributed Checksum Clearinghouse. That said, the obvious spammer
response when people do that is to make messages which are more and more
dissimilar for each recipient, again something where human malice can
probably defeat automated attempts to find similar messages. The DCC's
creator has said that he thinks that it will eventually be most useful
against "mainsleaze", that is, spam from big businesses who will not
want to use the sort of filter-evading tactics which are popular with
the "enlarge your naked cheerleaders"[1] crowd.

[1] How many boneheaded keyword filters will now bounce this post when
it goes out as mail on the python list, I wonder? There's an awful lot
of snake oil out there being sold as spam filters.

-- 
Paul Wright | http://pobox.com/~pw201 |




More information about the Python-list mailing list