Graham's spam filter (was Lisp to Python translation criticism?)

Erik Max Francis max at alcyone.com
Tue Aug 20 19:41:00 EDT 2002


David LeBlanc wrote:

> Looking it over, I wonder if some optimizations aren't possible or
> desirable. One that came to mind is to retain url's/urn's as distinct
> tokens.

Yeah, that occured to me as well.  I wrote the Graham filter code I
posted, did some basic checking to make sure it wasn't obviously wrong,
but haven't put it into practice.  I already have a rule-based filter
(in Python) which is serving me pretty well; building up the corpora to
do the statistical filtering would be somewhat inconvenient at present.

> One thing I don't see how to do is to add a corpus containing a new
> message
> (good or bad) to the database - i.e. update the database. Maybe
> Database.addGood() and Database.addBad()?

Ah, yeah, good point.  Really the call to the .build method in
Database's constructor was a test driver; in reality you'd keep the good
and bad databases in attributes and be able to run .build manually. 
Then you could just add data to the corpora as needed.  Something like
[untested]:

	class Database:
	    def __init__(self, good, bad):
	        self.good = good
	        self.bad = bad
	
	    def build(self): # no arguments
	        ngood = self.good.count
	        # everything here changed to self.good and self.bad
	        ...

Then to add something to the good corpus, for example, you'd just do

	database.good.process(...)
	database.build()

> With a known good message, I keep getting 0.0000... from the
> Database.scan()
> and I don't know if that's correct. With a known spam file I get 1.0.

Yes, that's right.  If you pick things from the good or bad corpora, the
probabilities will reinforce strongly to make the calculated probability
either very near zero or one.

-- 
 Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
 __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/  \ There is nothing so subject to the inconstancy of fortune as war.
\__/ Miguel de Cervantes
    Church / http://www.alcyone.com/pyos/church/
 A lambda calculus explorer in Python.



More information about the Python-list mailing list