Graham's spam filter (was Lisp to Python translation criticism?)
Christopher Browne
cbbrowne at acm.org
Tue Aug 20 20:14:32 EDT 2002
In the last exciting episode, "David LeBlanc" <whisper at oz.net> wrote::
> Looking it over, I wonder if some optimizations aren't possible or
> desirable. One that came to mind is to retain url's/urn's as distinct
> tokens.
I'd suggest the thought of doing message header associations as
tokens, so that you might get, out of:
Subject: Re: Graham's spam filter (was Lisp to Python translation criticism?)
the set of tokens:
subject::re
subject::graham's
subject::spam
subject::filter
subject::was
subject::lisp
subject::to
subject::python
subject::translation
subject::criticism
Then do something similar with .signature material:
signature::a
signature::ago
signature::been
signature::bug
signature::by
signature::fixed
signature::for
signature::from
signature::guidelines
signature::hasn't
signature::in
signature::independently
signature::it
signature::long
signature::mail
signature::out
signature::pointing
signature::released
signature::report
signature::sending
signature::symbolics
signature::system
signature::that
signature::that
signature::the
signature::trivialize
signature::user's
signature::was
signature::yet
>> One obvious and immediate issue is that for an industrial-strength
>> filter, the database gets _huge_ (Graham's basic setup involved
>> 4000 messages each in the spam and nonspam corpora), and reading
>> and writing the database (even with cPickle) each time a spam
>> message comes through starts to become intensive.
> I am going to build a version to use Metakit. Should be good for up
> to about 10Mb of messages if I read the Metakit site right.
> One thing I don't see how to do is to add a corpus containing a new
> message (good or bad) to the database - i.e. update the
> database. Maybe Database.addGood() and Database.addBad()?
It works a whopping lot better if there's a whopping lot more than
just two categories...
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/unix.html
Trivialize a user's bug report by pointing out that it was fixed
independently long ago in a system that hasn't been released yet.
-- from the Symbolics Guidelines for Sending Mail
More information about the Python-list
mailing list