Graham's spam filter (was Lisp to Python translation criticism?)
David LeBlanc
whisper at oz.net
Tue Aug 20 18:48:05 EDT 2002
> "Erik Max Francis" wrote:
>
> Here's my implementation of Graham's statistical filter in Python. It's
> based on a Corpus class (a specialized dictionary) that processes data
> (each call of the .process method should be the entire concatenated text
> of a distinct message). One builds up two corpora [had to look that one
> up!] -- good and bad -- and then hands them to a Database instance,
> which computes the appropriate probability table. When you want to test
> a new message, create a Corpus for it and then pass it to the database's
> .scan method, which will return the computed probability of the message
> being spam.
>
> I've fiddled around with the code briefly and it doesn't look obviously
> wrong, though I still lack the definitive representative samples of spam
> and nonspam.
>
Looking it over, I wonder if some optimizations aren't possible or
desirable. One that came to mind is to retain url's/urn's as distinct
tokens.
<snip>
>
> One obvious and immediate issue is that for an industrial-strength
> filter, the database gets _huge_ (Graham's basic setup involved 4000
> messages each in the spam and nonspam corpora), and reading and writing
> the database (even with cPickle) each time a spam message comes through
> starts to become intensive.
I am going to build a version to use Metakit. Should be good for up to about
10Mb of messages if I read the Metakit site right.
One thing I don't see how to do is to add a corpus containing a new message
(good or bad) to the database - i.e. update the database. Maybe
Database.addGood() and Database.addBad()?
> --
> Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
> __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
> / \ There is nothing so subject to the inconstancy of fortune as war.
> \__/ Miguel de Cervantes
> Church / http://www.alcyone.com/pyos/church/
> A lambda calculus explorer in Python.
> --
Here's a version of your code with the minor enhancement of a primative
driver.
import re
TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+")
FREQUENCY_THRESHHOLD = 5
GOOD_BIAS = 2.0
BAD_BIAS = 1.0
GOOD_PROB = 0.01
BAD_PROB = 0.99
UNKNOWN_PROB = 0.2
class Corpus(dict):
def __init__(self, data=None):
dict.__init__(self)
self.count = 0
if data is not None:
self.process(data)
def process(self, data):
tokens = TOKEN_RE.findall(data)
for token in tokens:
if self.has_key(token):
self[token] += 1
else:
self[token] = 1
self.count += 1
class Database(dict):
def __init__(self, good, bad):
dict.__init__(self)
self.build(good, bad)
def build(self, good, bad):
ngood = good.count
nbad = bad.count
for token in good.keys() + bad.keys(): # doubles up, but works
if not self.has_key(token):
g = GOOD_BIAS*good.get(token, 0)
b = BAD_BIAS*bad.get(token, 0)
if g + b >= FREQUENCY_THRESHHOLD:
goodMetric = min(1.0, g/ngood)
badMetric = min(1.0, b/nbad)
total = goodMetric + badMetric
prob = max(GOOD_PROB, min(BAD_PROB,badMetric/total))
self[token] = prob
def scan(self, corpus):
pairs = [(token, self.get(token, UNKNOWN_PROB)) \
for token in corpus.keys()]
pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1] - 0.5)))
significant = pairs[:15]
inverseProduct = product = 1.0
for token, prob in significant:
product *= prob
inverseProduct *= 1.0 - prob
return product/(product + inverseProduct)
if __name__ == '__main__':
import os, string
# Build reference corpus
bad = Corpus()
good = Corpus()
# Put your own path to spam/bad files here.
# 1 message per .txt file is assumed. Messages include headers.
os.chdir("l:/apps/python/email/grafilt/spam")
names = os.listdir(os.curdir)
for name in names:
print "bad file: ", name
fn = open(name)
txt = fn.readlines()
fn.close()
bad.process("".join(txt))
# Put your own path to good files here.
# 1 message per .txt file is assumed. Messages include headers.
os.chdir("l:/apps/python/email/grafilt/good")
names = os.listdir(os.curdir)
for name in names:
print "good file: ", name
fn = open(name)
txt = fn.readlines()
fn.close()
good.process("".join(txt))
# Reference corpus
db = Database(good,bad)
for key in db.keys():
#print "key: %s value: %s" % (key, db[key])
pass
# Test message
fn = open("l:/apps/python/email/grafilt/good/ZootForum Re Read only
Mode.txt")
txt = fn.readlines()
fn.close
test = Corpus("".join(txt))
print "%.3f" % db.scan(test)
---------------------------------------------------------
With a known good message, I keep getting 0.0000... from the Database.scan()
and I don't know if that's correct. With a known spam file I get 1.0.
Regards,
Dave LeBlanc
Seattle, WA USA
More information about the Python-list
mailing list