Graham's spam filter (was Lisp to Python translation criticism?)

Christopher Browne cbbrowne at acm.org
Sat Aug 17 18:04:09 EDT 2002


Paul Rubin <phr-n2002b at NOSPAMnightsong.com> wrote:
> Erik Max Francis <max at alcyone.com> writes:
>> One obvious and immediate issue is that for an industrial-strength
>> filter, the database gets _huge_ (Graham's basic setup involved 4000
>> messages each in the spam and nonspam corpora), and reading and writing
>> the database (even with cPickle) each time a spam message comes through
>> starts to become intensive.
>
> Why not use dbhash?  I think there's also a Python cdb wrapper somewhere.

cdb should be _really_ good for it.

By the way, _my_ setup, with Ifile, involves a corpus of tens of
thousands of messages, that probably exceeds 500MB.

Ifile distills that down to a "corpus file" about 7.5MB long.
-- 
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://cbbrowne.com/info/spreadsheets.html
"What's wrong with 3rd party tools? Especially if they are free?  What
the **** do you think Unix is  anyway? It's a big honkin' party of 3rd
party free tools." -- Bob Cassidy (rmcassid at uci.edu)



More information about the Python-list mailing list