Graham's spam filter (was Lisp to Python translation criticism?)

Sat Aug 17 16:48:07 EDT 2002

"John E. Barham" wrote:

> Nice of the spammers to be giving us so much data to work with!

Here's my implementation of Graham's statistical filter in Python.  It's
based on a Corpus class (a specialized dictionary) that processes data
(each call of the .process method should be the entire concatenated text
of a distinct message).  One builds up two corpora [had to look that one
up!] -- good and bad -- and then hands them to a Database instance,
which computes the appropriate probability table.  When you want to test
a new message, create a Corpus for it and then pass it to the database's
.scan method, which will return the computed probability of the message
being spam.

I've fiddled around with the code briefly and it doesn't look obviously
wrong, though I still lack the definitive representative samples of spam
and nonspam.

TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+")
FREQUENCY_THRESHHOLD = 5
GOOD_BIAS = 2.0
BAD_BIAS = 1.0
GOOD_PROB = 0.01
BAD_PROB = 0.99
UNKNOWN_PROB = 0.2

class Corpus(dict):
    def __init__(self, data=None):
        dict.__init__(self)
        self.count = 0
        if data is not None:
            self.process(data)

    def process(self, data):
        tokens = TOKEN_RE.findall(data)
        for token in tokens:
            if self.has_key(token):
                self[token] += 1
            else:
                self[token] = 1
        self.count += 1

class Database(dict):
    def __init__(self, good, bad):
        dict.__init__(self)
        self.build(good, bad)

    def build(self, good, bad):
        ngood = good.count
        nbad = bad.count
        for token in good.keys() + bad.keys(): # doubles up, but works
            if not self.has_key(token):
                g = GOOD_BIAS*good.get(token, 0)
                b = BAD_BIAS*bad.get(token, 0)
                if g + b >= FREQUENCY_THRESHHOLD:
                    goodMetric = min(1.0, g/ngood)
                    badMetric = min(1.0, b/nbad)
                    total = goodMetric + badMetric
                    prob = max(GOOD_PROB, min(BAD_PROB,badMetric/total))
                    self[token] = prob

    def scan(self, corpus):
        pairs = [(token, self.get(token, UNKNOWN_PROB)) \
                 for token in corpus.keys()]
        pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1] - 0.5)))
        significant = pairs[:15]
        inverseProduct = product = 1.0
        for token, prob in significant:
            product *= prob
            inverseProduct *= 1.0 - prob
        return product/(product + inverseProduct)

One obvious and immediate issue is that for an industrial-strength
filter, the database gets _huge_ (Graham's basic setup involved 4000
messages each in the spam and nonspam corpora), and reading and writing
the database (even with cPickle) each time a spam message comes through
starts to become intensive.

-- 
 Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
 __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/  \ There is nothing so subject to the inconstancy of fortune as war.
\__/ Miguel de Cervantes
    Church / http://www.alcyone.com/pyos/church/
 A lambda calculus explorer in Python.