Graham's spam filter (was Lisp to Python translation criticism?)

Sat Aug 17 18:04:09 EDT 2002

Centuries ago, Nostradamus foresaw when Erik Max Francis <max at alcyone.com> would write:
> One obvious and immediate issue is that for an industrial-strength
> filter, the database gets _huge_ (Graham's basic setup involved 4000
> messages each in the spam and nonspam corpora), and reading and writing
> the database (even with cPickle) each time a spam message comes through
> starts to become intensive.

May I make a suggestion from the IFile implementation?

It collects the "corpus" into a file that looks like the following:

folder1 folder2 folder3 folder4 folder5 [and so forth]
17251 14271 13710 2378 37248 [... the number of words in each folder]
42 44 35 11 92 [... the number of messages in each folder]
sex 1:2 5:37 [The word "sex" occurs in folder #1 2 times, and in #5 37
              times]
from 1:75 2:49 3:65 4:17 5:175 ...

This represents a nicely compact way of representing the data as a
text file.

The thing did to Ifile to make it a _lot_ more efficient in
classifying messages was to start by reading the message and getting
the message's list of words.  When reading the corpus, I could then
skip doing _any_ parsing for those words that weren't in the message.

What I'd do, if storing the corpus in a DBM file, would be to store
stuff like

corpus["sex"] = '1:2 5:37'
corpus["from"] = '1:75 2:49 3:65 4:17 5:175 ...'

Thus, if there are 400 words in the message, you read 400 word entries
from the corpus, parse the contents, and add stats for each folder.

This is about as sparse a data representation as you're going to get,
and it should be quite efficient to parse thru.

I would urge considering using, as a "serial" format, the _very same
one_ as used by Ifile, as it is quite good, interoperability is almost
always a good thing, and to at least _think_ about interoperability is
a sensible thing to do.

What I'd have liked to do with Ifile, that the current form of the C
code makes challenging, is to parse email messages more intelligently
and give express indication of what part of the message your're in.

Thus, if it's in the header, it would take the line:
  From: renewal at acm.org

and generate a corpus "word" that combines the header line with the
word, like:

From:renewal at acm.org 2:1 3:7

and
Subject: Your ACM membership is about to expire! Renew online today!

wouldn't simply add to the respective words, but would rather add "1"
to each of:

  Subject:your, Subject:acm, Subject:membership, Subject:is,
  Subject:about, Subject:to, Subject:expire, Subject:renew,
  Subject:online, Subject:today

Similarly, anything after the final "-- " line would be marked as
signature, so that part of the .sig on this message would get
"corpused" as the set of 'extended words':

  sig::if sig::two sig::people sig::love sig::each sig::other
  sig::there sig::can sig::be sig::no sig::happy sig::end sig::to
  sig::it sig::hemingway

The point of all this is twofold:

 1) To make people think about the fact that adding words to the
    corpus is not at all a bad thing, as if you're only going to be
    reading 300-400 of them to process one message, it doesn't much
    MATTER if the corpus has 30,000 words, or 300,000 words;

 2) Giving extra "grist" for discrimination is _ALWAYS_ a good thing
    for this scheme.  If you can express that "this word was in the
    header" or that "this word was in the .signature," that is pretty
    much guaranteed to be helpful.

More folders --> Good thing.
More "sections" of message --> Good thing.
More words in corpus --> Good thing.
-- 
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://cbbrowne.com/info/ifilter.html
If two people love each other, there can be no happy end to it.
-- Hemingway