Graham's spam filter (was Lisp to Python translation criticism?)

Christopher Browne cbbrowne at acm.org
Tue Aug 20 20:14:32 EDT 2002


In the last exciting episode, "David LeBlanc" <whisper at oz.net> wrote::
> Looking it over, I wonder if some optimizations aren't possible or
> desirable. One that came to mind is to retain url's/urn's as distinct
> tokens.

I'd suggest the thought of doing message header associations as
tokens, so that you might get, out of:

  Subject: Re: Graham's spam filter (was Lisp to Python translation criticism?)

the set of tokens:
subject::re
subject::graham's
subject::spam
subject::filter
subject::was
subject::lisp
subject::to
subject::python
subject::translation
subject::criticism

Then do something similar with .signature material:

signature::a 
signature::ago 
signature::been
signature::bug
signature::by 
signature::fixed
signature::for
signature::from 
signature::guidelines 
signature::hasn't 
signature::in
signature::independently 
signature::it 
signature::long 
signature::mail
signature::out
signature::pointing 
signature::released 
signature::report 
signature::sending 
signature::symbolics 
signature::system 
signature::that 
signature::that 
signature::the
signature::trivialize 
signature::user's 
signature::was 
signature::yet 

>> One obvious and immediate issue is that for an industrial-strength
>> filter, the database gets _huge_ (Graham's basic setup involved
>> 4000 messages each in the spam and nonspam corpora), and reading
>> and writing the database (even with cPickle) each time a spam
>> message comes through starts to become intensive.

> I am going to build a version to use Metakit. Should be good for up
> to about 10Mb of messages if I read the Metakit site right.

> One thing I don't see how to do is to add a corpus containing a new
> message (good or bad) to the database - i.e. update the
> database. Maybe Database.addGood() and Database.addBad()?

It works a whopping lot better if there's a whopping lot more than
just two categories...
-- 
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/unix.html
Trivialize   a user's bug report  by  pointing out that   it was fixed
independently long ago in a system that hasn't been released yet.
-- from the Symbolics Guidelines for Sending Mail



More information about the Python-list mailing list