Fighting Spam with Python

Fri Aug 26 10:02:40 EDT 2005

On Thu, 25 Aug 2005 13:22:53 -0400, François Pinard wrote:
>[David MacQuigg]
>
>> The key new features needed in a spam filter are the ability to
>> extract the sender's identity (not that of the latest forwarder), and
>> to factor into the spam score the reputation of that identity.
>
>This will only work if your system is immune to forgeries, while being
>largely widespread.

Stopping forgery is what the new authentication methods are all about.
Getting these methods widely and effectively used is our big
challenge, and one that I hope to accomplish with my efforts.  There
are a bunch of pieces that need to work together more smoothly.
That's where Python comes in.  There are some challenging constraints,
like the system has to work without government regulation.  I've got a
first draft of a website for open-mail.org - temporarily at
http://purl.net/macquigg/email/registry  Suggestions are welcome.

>> In the flow we envision, the spam filter is the final process, used
>> only on the 5% that is hard to classify.  80% will get an immediate
>> reject.  15% will get an immediate accept without filtering, because
>> the sender is authenticated and has a good reputation.  Eventually,
>> all reputable senders will join the 15%, and the 5% will shrink to
>> where we can ignore it.
>
>It's fun to read statistics about a vision! :-)

The 80% is real. http://messagelabs.com/emailthreats  As to how the
remaining 20% will split, that's a guess, but one that I think is
realistic.  See http://www.spamhaus.org/effective_filtering.html for
comparable numbers using only IP blacklists and spam filtering.

The 5% still needing filtering will be those senders that don't offer
any authentication or that authenticate with an identity that has not
yet acquired a reputation.

>> >You might find www.spambayes.org of interest, in several ways.
>
>Spambayes is surprisingly good as it already stands.

I haven't used Spambayes, but my experience with Spamnix (an offshoot
of Spam Assassin) is that statistical filters always have a few false
rejects.  In my case, that's about two per week.

The solution to this problem is a reliable system allowing receivers
to determine the identity and reputation of an unknown sender.  Then
we can safely ignore the spam.

-- Dave