Graham's spam filter
Christopher Browne
cbbrowne at acm.org
Fri Aug 23 00:42:12 EDT 2002
In an attempt to throw the authorities off his trail, Oren Tirosh <oren-py-l at hishome.net> transmitted:
> On Thu, Aug 22, 2002 at 10:29:36AM -0600, Joseph A. Knapka wrote:
>> The analyzer takes about two minutes to go through my corpus of
>> about 2000 messages. The filter starts and loads the probability
>> dictionary in under five seconds. Doesn't seem like a non-starter
>> to me :-)
>
> For a lots of standard mail components the easiest and most robust way
> to interface to them is running an executable separately for each message.
> In this case five seconds startup time may be a bit too much for sites
> with high load.
>
>> (Of course, the user should never have to deal with
>> either program, except to configre it. The filter reads from
>> a POP3 or IMAP mailbox and writes the spam-free messages
>> either to a file or to another "sanitized" SMTP mailbox,
>> which is the one the user checks.)
>
> In this model the program is started once for multiple messages so a
> somewhat slower startup is not an issue.
This model is in effect like a "database server" model. You start up
a DBMS process once, and it loads in a bunch of data. Once in memory,
access is quick, much moreso than if you have to keep reading the data
in over and over again.
Cacheing is not a meaningful objection to that; part of the cost of
loading in data is in parsing what's on disk. Not parsing the data a
bunch of times is The Win.
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://cbbrowne.com/info/nonrdbms.html
"I withdraw my claim that rpm is proprietary -- my objections were
based on the documentation for the version of rpm (2.2.6) that I used
as a documentation source when writing makepkg and xrpm."
-- david parsons
More information about the Python-list
mailing list