Graham's spam filter (was Lisp to Python translation criticism?)

Tue Aug 20 23:09:45 EDT 2002

Oops! "David LeBlanc" <whisper at oz.net> was seen spray-painting on a wall:
>> -----Original Message-----
>> From: python-list-admin at python.org
>> [mailto:python-list-admin at python.org]On Behalf Of Christopher Browne
>> Sent: Tuesday, August 20, 2002 17:15
>> To: python-list at python.org
>> Subject: Re: Graham's spam filter (was Lisp to Python translation
>> criticism?)
>>
>>
> <snip>
>> I'd suggest the thought of doing message header associations as
>> tokens, so that you might get, out of:
>>
>>   Subject: Re: Graham's spam filter (was Lisp to Python
>> translation criticism?)
>>
>> the set of tokens:
>> subject::re
>> subject::graham's
> <snip>
>> subject::Python
>>
>> Then do something similar with .signature material:
>>
>> signature::a
>> signature::ago
>> signature::been
> <snip>
>
> What's the advantage of this?

The advantage is that it discriminates between words in the header,
words in the body, and words in the .signature.

The whole point of the exercise is to do discrimination; the more
useful criteria there are, the better.

> <snip>
>
>> > One thing I don't see how to do is to add a corpus containing a new
>> > message (good or bad) to the database - i.e. update the
>> > database. Maybe Database.addGood() and Database.addBad()?
>>
>> It works a whopping lot better if there's a whopping lot more than
>> just two categories...
>
> I agree that a complete mail program should have the ability to sort
> mail into many categories and this phase of the operation is not
> where to do it.  This is a pass/fail filtration step, not a sort
> step.

Then you are essentially seeking to have your system try to have two
parameters:

  -> What does the "average good email" look like, and
  -> What does the "average bad email" look like.

Since both of those characterize large "clouds" of entries, where, for
instance:

  -> "Good" email includes notes from friends, notes from technical
      associates, and such, which have varying characteristics;

  -> "Bad" email, where some have lots of "Nigerian Scam" words,
      and others talk a lot about casinos, breast enlargement, 
      alternatives to Viagra, where to buy mailing lists, and such.

If you merge the categories together, what you get is a cloudy sort of
"average."

Suppose a projection of relevance values onto the vector space of
messages looks something like:

+------------------------------------------------------------------+
|                                   Mail from           Python     +
|                                     Mom                Lists     +
|     Nigerian    Snakeoil                                         +
|      Scams                                                       +
|                                                                  +
|             + Spam Centroid                                      +
|     Casinos                                                      +
|                              School                              +
|               Credit         Alumni          + Good Mail         +
|                                                Centroid          +
|                                                                  +
|                                                                  +
|                               Brothers                           +
|                                                                  +
|                                                                  +
|                                                  DBMS Discussion |
+------------------------------------------------------------------+

(I'm pretending it makes sense to project this onto two dimensions.
In a sense, there's a dimension for each word is considered, so that
if there are 30000 words in your dictionary, there's a _PILE_ of
dimensions!)

If everything gets "averaged," then what you have are two categories,
"good" and "bad," and whether something's "good" or "bad" depends on
how close its value lies to the appropriate centroid.  (Two of them
being labeled.)

If you have a whole whack of categories, it means you're looking at
nearness not to merely two "centroids," but rather look for the
nearest centroid.  Note that the "cloud" around the 'Good Mail
Centroid' is rather large.  In fact, in this diagram, mail from
schoolmates may wind up looking as if it should be categorized as
spam.

I arbitrarily chose that; the point is that the simple "good versus
bad" is something of an oversimplification.  You've got a lot of
statistics, and you're not using them all.  

I would _definitely_ argue that having several spam folders to choose
from should be helpful, as it allows taking advantage of the fact that
(for instance) African Financial Scams have _really_ similar
characteristics, and you can be _really_ confident that you've got a
Nigerian Pyramid Scam.  That gives _greater_ certainty of appropriate
message classifications.
-- 
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/sgml.html
"The Amiga  is proof that  if you build  a better mousetrap,  the rats
will gang up on you."  -- Bill Roberts bill.roberts at ensco.com