Graham's spam filter

Christopher Browne cbbrowne at acm.org
Fri Aug 23 00:42:12 EDT 2002


In an attempt to throw the authorities off his trail, Oren Tirosh <oren-py-l at hishome.net> transmitted:
> On Thu, Aug 22, 2002 at 10:29:36AM -0600, Joseph A. Knapka wrote:
>> The analyzer takes about two minutes to go through my corpus of
>> about 2000 messages. The filter starts and loads the probability
>> dictionary in under five seconds. Doesn't seem like a non-starter
>> to me :-) 
>
> For a lots of standard mail components the easiest and most robust way 
> to interface to them is running an executable separately for each message. 
> In this case five seconds startup time may be a bit too much for sites 
> with high load.
>
>> (Of course, the user should never have to deal with
>> either program, except to configre it. The filter reads from
>> a POP3 or IMAP mailbox and writes the spam-free messages
>> either to a file or to another "sanitized" SMTP mailbox,
>> which is the one the user checks.)
>
> In this model the program is started once for multiple messages so a 
> somewhat slower startup is not an issue.

This model is in effect like a "database server" model.  You start up
a DBMS process once, and it loads in a bunch of data.  Once in memory,
access is quick, much moreso than if you have to keep reading the data
in over and over again.

Cacheing is not a meaningful objection to that; part of the cost of
loading in data is in parsing what's on disk.  Not parsing the data a
bunch of times is The Win.
-- 
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://cbbrowne.com/info/nonrdbms.html
"I withdraw  my claim  that rpm is  proprietary -- my  objections were
based on the documentation for the  version of rpm (2.2.6) that I used
as a  documentation source when  writing makepkg and xrpm."   
-- david parsons



More information about the Python-list mailing list