Graham's spam filter

Sun Aug 18 01:31:57 EDT 2002

Erik Max Francis <max at alcyone.com> writes:
> One obvious and immediate issue is that for an industrial-strength
> filter, the database gets _huge_, and reading and writing
> the database (even with cPickle) each time a spam message comes through
> starts to become intensive.

"John E. Barham" <jbarham at jbarham.com> wrote previously:
|But I don't think that a pickled dictionary/database would be unmanageably
|huge, even w/ a large set of input messages, since the rate of growth of the
|"vocabulary" (i.e., set of tokens) would slow as more messages were input.

I wrote an application called indexer.py that is part of my
Gnosis_Utils.  The utilities are at:

    http://gnosis.cx/download/Gnosis_Utils-current.tar.gz

A blurb on them is at:

    http://gnosis.cx/download/Gnosis_Utils.ANNOUNCE

An article on the design is at:

    http://gnosis.cx/publish/programming/charming_python_15.html

There is actually a striking similarity between creating a word index
for searching and creating Graham's Bayesian model.

My experience in testing the indexer is perhaps surprising.  Contrary to
Barham's assumption--and my initial belief--total words DO NOT reach an
asymptote... or at least not within any non-huge scale.  The problem is
that a whole lot of things look a bit like words to a naive text
splitter.

I found I was able to reduce the identified words quite a bit with a few
heuristics.  The easiest of these is to normalize case, which seems
consistent with the principle of Graham's approach.  But past that, a
number of little tweaks and kludges helped.

Feel free to download my library, and pull out whatever parts might be
useful.  But for those who don't want the whole thing, this is the
relevant class (depending on your code, this could easily become a
function instead):

    #-- "Split plain text into words" utility function
    class TextSplitter:
        def initSplitter(self):
            prenum  = string.join(map(chr, range(0,48)), '')
            num2cap = string.join(map(chr, range(58,65)), '')
            cap2low = string.join(map(chr, range(91,97)), '')
            postlow = string.join(map(chr, range(123,256)), '')
            nonword = prenum + num2cap + cap2low + postlow
            self.word_only = string.maketrans(nonword, " "*len(nonword))
            self.nondigits = string.join(map(chr, range(0,48)) + map(chr, range(58,255)), '')
            self.alpha = string.join(map(chr, range(65,91)) + map(chr, range(97,123)), '')
            self.ident = string.join(map(chr, range(256)), '')
            self.init = 1

        def splitter(self, text, ftype):
            "Split the contents of a text string into a list of 'words'"
            if ftype == 'text/plain':
                words = self.text_splitter(text, self.casesensitive)
            else:
                raise NotImplementedError
            return words

        def text_splitter(self, text, casesensitive=0):
            """Split text/plain string into a list of words

            In version 0.20 this function is still fairly weak at
            identifying "real" words, and excluding gibberish
            strings.  As long as the indexer looks at "real" text
            files, it does pretty well; but if indexing of binary
            data is attempted, a lot of gibberish gets indexed.
            Suggestions on improving this are GREATLY APPRECIATED.
            """
            # Initialize some constants
            if not hasattr(self,'init'): self.initSplitter()

            # Speedup trick: attributes into local scope
            word_only = self.word_only
            ident = self.ident
            alpha = self.alpha
            nondigits = self.nondigits
            translate = string.translate

            # Let's adjust case if not case-sensitive
            if not casesensitive: text = string.upper(text)

            # Split the raw text
            allwords = string.split(text)

            # Finally, let's skip some words not worth indexing
            words = []
            for word in allwords:
                if len(word) > 25: continue         # too long (probably gibberish)

                # Identify common patterns in non-word data (binary, UU/MIME, etc)
                num_nonalpha = len(word.translate(ident, alpha))
                numdigits    = len(word.translate(ident, nondigits))
                # 1.52: num_nonalpha = len(translate(word, ident, alpha))
                # 1.52: numdigits    = len(translate(word, ident, nondigits))
                if numdigits > len(word)-2:         # almost all digits
                    if numdigits > 5:               # too many digits is gibberish
                        continue                    # a moderate number is year/zipcode/etc
                elif num_nonalpha*3 > len(word):    # too much scattered nonalpha = gibberish
                    continue

                word = word.translate(word_only)    # Let's strip funny byte values
                # 1.52: word = translate(word, word_only)
                subwords = word.split()             # maybe embedded non-alphanumeric
                # 1.52: subwords = string.split(word)
                for subword in subwords:            # ...so we might have subwords
                    if len(subword) <= 2: continue  # too short a subword
                    words.append(subword)
            return words

--
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad
---[ Postmodern Enterprises <mertz at gnosis.cx> ]--------------------------