Graham's spam filter
David Mertz, Ph.D.
mertz at gnosis.cx
Sun Aug 18 01:31:57 EDT 2002
Erik Max Francis <max at alcyone.com> writes:
> One obvious and immediate issue is that for an industrial-strength
> filter, the database gets _huge_, and reading and writing
> the database (even with cPickle) each time a spam message comes through
> starts to become intensive.
"John E. Barham" <jbarham at jbarham.com> wrote previously:
|But I don't think that a pickled dictionary/database would be unmanageably
|huge, even w/ a large set of input messages, since the rate of growth of the
|"vocabulary" (i.e., set of tokens) would slow as more messages were input.
I wrote an application called indexer.py that is part of my
Gnosis_Utils. The utilities are at:
http://gnosis.cx/download/Gnosis_Utils-current.tar.gz
A blurb on them is at:
http://gnosis.cx/download/Gnosis_Utils.ANNOUNCE
An article on the design is at:
http://gnosis.cx/publish/programming/charming_python_15.html
There is actually a striking similarity between creating a word index
for searching and creating Graham's Bayesian model.
My experience in testing the indexer is perhaps surprising. Contrary to
Barham's assumption--and my initial belief--total words DO NOT reach an
asymptote... or at least not within any non-huge scale. The problem is
that a whole lot of things look a bit like words to a naive text
splitter.
I found I was able to reduce the identified words quite a bit with a few
heuristics. The easiest of these is to normalize case, which seems
consistent with the principle of Graham's approach. But past that, a
number of little tweaks and kludges helped.
Feel free to download my library, and pull out whatever parts might be
useful. But for those who don't want the whole thing, this is the
relevant class (depending on your code, this could easily become a
function instead):
#-- "Split plain text into words" utility function
class TextSplitter:
def initSplitter(self):
prenum = string.join(map(chr, range(0,48)), '')
num2cap = string.join(map(chr, range(58,65)), '')
cap2low = string.join(map(chr, range(91,97)), '')
postlow = string.join(map(chr, range(123,256)), '')
nonword = prenum + num2cap + cap2low + postlow
self.word_only = string.maketrans(nonword, " "*len(nonword))
self.nondigits = string.join(map(chr, range(0,48)) + map(chr, range(58,255)), '')
self.alpha = string.join(map(chr, range(65,91)) + map(chr, range(97,123)), '')
self.ident = string.join(map(chr, range(256)), '')
self.init = 1
def splitter(self, text, ftype):
"Split the contents of a text string into a list of 'words'"
if ftype == 'text/plain':
words = self.text_splitter(text, self.casesensitive)
else:
raise NotImplementedError
return words
def text_splitter(self, text, casesensitive=0):
"""Split text/plain string into a list of words
In version 0.20 this function is still fairly weak at
identifying "real" words, and excluding gibberish
strings. As long as the indexer looks at "real" text
files, it does pretty well; but if indexing of binary
data is attempted, a lot of gibberish gets indexed.
Suggestions on improving this are GREATLY APPRECIATED.
"""
# Initialize some constants
if not hasattr(self,'init'): self.initSplitter()
# Speedup trick: attributes into local scope
word_only = self.word_only
ident = self.ident
alpha = self.alpha
nondigits = self.nondigits
translate = string.translate
# Let's adjust case if not case-sensitive
if not casesensitive: text = string.upper(text)
# Split the raw text
allwords = string.split(text)
# Finally, let's skip some words not worth indexing
words = []
for word in allwords:
if len(word) > 25: continue # too long (probably gibberish)
# Identify common patterns in non-word data (binary, UU/MIME, etc)
num_nonalpha = len(word.translate(ident, alpha))
numdigits = len(word.translate(ident, nondigits))
# 1.52: num_nonalpha = len(translate(word, ident, alpha))
# 1.52: numdigits = len(translate(word, ident, nondigits))
if numdigits > len(word)-2: # almost all digits
if numdigits > 5: # too many digits is gibberish
continue # a moderate number is year/zipcode/etc
elif num_nonalpha*3 > len(word): # too much scattered nonalpha = gibberish
continue
word = word.translate(word_only) # Let's strip funny byte values
# 1.52: word = translate(word, word_only)
subwords = word.split() # maybe embedded non-alphanumeric
# 1.52: subwords = string.split(word)
for subword in subwords: # ...so we might have subwords
if len(subword) <= 2: continue # too short a subword
words.append(subword)
return words
--
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad
---[ Postmodern Enterprises <mertz at gnosis.cx> ]--------------------------
More information about the Python-list
mailing list