patch to make GBayes work with maildir and anydbm

Sat Aug 24 02:33:39 EDT 2002

Aloha.

I've been playing around with GBayes.py and classifier.py from python
CVS.  Since I can't release the code I've already written to do this at
work, I've started over with Barry's code.  Thanks, Barry!  (And thanks
Tim (one) too, it seems.)

Since I use gnus to read mail, all my messages are one per file in a
directory.  I added a quick hack to GBayes.py to have it stat the "good"
and "spam" mailboxes; if they're directories, then it assumes they're
maildir or mh (or nnmail) folders.  Diff below.

Additionally, I've coded up a wrapper around GBayes and classifier that
write out to a database.  It seems to be working; I have it chugging
away on my mailboxes.  It is SLLOOWWWW writing.  Much slower than
GBayes.py.  The file it writes is about four times larger, too.  Here's
what I get running on my K6/333 with Linux:

  gwydion:~/src/import/spambayes$ time ./hammie.py -p poo.db -g /home/neale/Mail/inbox -s /home/neale/Mail/spam
  training with the known good messages
  done training 1641 messages
  training with the known spam messages
  done training 1066 messages

  real    25m11.140s
  user    19m12.530s
  sys     2m11.450s

Actually, I got a traceback after training because I hadn't defined
itervalues on my dictionary class.  But the point is, it's slow.  Here's
GBayes.py on the same data:

  gwydion:~/src/import/spambayes$ time ./GBayes.py -p poo -g /home/neale/Mail/inbox -s /home/neale/Mail/spam
  training with the known good messages
  done training 1641 messages
  training with the known spam messages
  done training 1066 messages

  real    4m43.060s
  user    4m19.690s
  sys     0m4.850s


That difference is OUTRAGEOUS!  However, GBayes.py used up over 76M of
RAM while it was running, so I don't think right now it's very practical
for a multi-user system.  If it started swapping, I bet the db version
would beat it hands down.

On the other hand, reading from the dbhash is pretty quick, compared to
GBayes' gigantic pickle:

  gwydion:~/src/import/spambayes$ time ./hammie.py -p poo.db -u /home/neale/Mail/lists.mbox
  classifying the unknown
  ...
  Num messages = 52
  Good count = 52
  Spam count = 0
  Hard to tell = 0

  real    0m13.185s
  user    0m11.160s
  sys     0m0.920s
  

  gwydion:~/src/import/spambayes$ time ./GBayes.py -p poo -u /home/neale/Mail/lists.mbox
  classifying the unknown
  ...
  Num messages = 52
  Good count = 52
  Spam count = 0
  Hard to tell = 0

  real    1m36.108s
  user    1m33.240s
  sys     0m1.520s

Most of the time for GBayes was spent reading in and writing out that
huge pickle.  When looking at messages, it was actually much faster than
the dbhash version though, obviously since it didn't have to go to disk
for anything.

I made one other teensy optimization: the values are only refreshed if
there were new good or spam messages to look at.  That lobbed off a few
seconds of runtime.

The size of both of these databases could be reduced drastically if MIME
were decoded, which is what I plan to do next.  In the meantime, here's
a diff and the source to my hammie.py.

---8<---
Index: GBayes.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/GBayes.py,v
retrieving revision 1.12
diff -u -r1.12 GBayes.py

--- GBayes.py	23 Aug 2002 15:42:48 -0000	1.12
+++ GBayes.py	24 Aug 2002 06:28:19 -0000
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/python2.2
 
 # A driver for the classifier module.  Barry Warsaw is the primary author.
 
@@ -32,13 +32,15 @@
         describe all available tokenizing functions, then exit
 When called without any options or arguments, a short self-test is run.
 """
-
+from __future__ import generators
 import sys
 import getopt
 import cPickle as pickle
 import mailbox
 import email
 import errno
+import stat
+import os
 
 from classifier import GrahamBayes
 
@@ -431,41 +433,68 @@
     # Assume Unix mailbox format
     if good:
         print 'training with the known good messages'
-        fp = open(good)
-        mbox = mailbox.PortableUnixMailbox(fp, _factory)
         i = 0
-        for msg in mbox:
-            # For now we'll take an extremely naive view of messages; we won't
-            # decode them at all, just to see what happens.  Later, we might
-            # want to uu- or base64-decode, or do other pre-processing on the
-            # message.
-            bayes.learn(tokenize(str(msg)), False, False)
-            i += 1
-            if count is not None and i > count:
-                break
-        fp.close()
+        if stat.S_ISDIR(os.stat(good)[stat.ST_MODE]):
+            mbox = os.listdir(good)
+            for msg in mbox:
+                try:
+                    bayes.learn(tokenize(open(good + "/" + msg).read()),
+                                False, False)
+                except IOError:
+                    continue
+                i += 1
+                if count is not None and i > count:
+                    break
+        else:
+            fp = open(good)
+            mbox = mailbox.PortableUnixMailbox(fp, _factory)
+            for msg in mbox:
+                # For now we'll take an extremely naive view of
+                # messages; we won't decode them at all, just to see
+                # what happens.  Later, we might want to uu- or
+                # base64-decode, or do other pre-processing on the
+                # message.
+                bayes.learn(tokenize(str(msg)), False, False)
+                i += 1
+                if count is not None and i > count:
+                    break
+            fp.close()
         save = True
         print 'done training', i, 'messages'
 
     if spam:
         print 'training with the known spam messages'
-        fp = open(spam)
-        mbox = mailbox.PortableUnixMailbox(fp, _factory)
         i = 0
-        for msg in mbox:
-            # For now we'll take an extremely naive view of messages; we won't
-            # decode them at all, just to see what happens.  Later, we might
-            # want to uu- or base64-decode, or do other pre-processing on the
-            # message.
-            bayes.learn(tokenize(str(msg)), True, False)
-            i += 1
-            if count is not None and i > count:
-                break
-        fp.close()
+        if stat.S_ISDIR(os.stat(spam)[stat.ST_MODE]):
+            mbox = os.listdir(spam)
+            for msg in mbox:
+                try:
+                    bayes.learn(tokenize(open(spam + "/" + msg).read()),
+                                False, False)
+                except IOError:
+                    continue
+                i += 1
+                if count is not None and i > count:
+                    break
+        else:
+            fp = open(spam)
+            mbox = mailbox.PortableUnixMailbox(fp, _factory)
+            for msg in mbox:
+                # For now we'll take an extremely naive view of
+                # messages; we won't decode them at all, just to see
+                # what happens.  Later, we might want to uu- or
+                # base64-decode, or do other pre-processing on the
+                # message.
+                bayes.learn(tokenize(str(msg)), True, False)
+                i += 1
+                if count is not None and i > count:
+                    break
+            fp.close()
         save = True
         print 'done training', i, 'messages'
 
-    bayes.update_probabilities()
+    if good or spam:
+        bayes.update_probabilities()
 
     if pck and save:
         fp = open(pck, 'wb')
Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.1
diff -u -r1.1 classifier.py
--- classifier.py	23 Aug 2002 15:42:48 -0000	1.1
+++ classifier.py	24 Aug 2002 06:28:20 -0000
@@ -194,7 +194,8 @@
 
         nham = float(self.nham or 1)
         nspam = float(self.nspam or 1)
-        for record in self.wordinfo.itervalues():
+        for k in self.wordinfo.iteritems():
+            key, record = k
             # Compute prob(msg is spam | msg contains word).
             hamcount = HAMBIAS * record.hamcount
             spamcount = SPAMBIAS * record.spamcount
@@ -210,7 +211,9 @@
                 elif prob > MAX_SPAMPROB:
                     prob = MAX_SPAMPROB
 
-            record.spamprob = prob
+            if prob != record.spamprob:
+                record.spamprob = prob
+                self.wordinfo[key] = record
 
         if self.DEBUG:
             print 'New probabilities:'
---8<---

hammie.py:
---8<---
#! /usr/bin/python2.2

from __future__ import generators
import classifier
import anydbm
import cPickle as pickle
import sys
import getopt
from GBayes import *

class dbdict:
    def __init__(self, dbname, iterskip=()):
        self.hash = anydbm.open(dbname, 'c')
        self.iterskip = iterskip

    def __getitem__(self, key):
        if self.hash.has_key(key):
            return pickle.loads(self.hash[key])
        else:
            raise KeyError(key)

    def __setitem__(self, key, val): 
        v = pickle.dumps(val, 1)
        self.hash[key] = v

    def __delitem__(self, key, val):
        del(self.hash[key])

    def __iter__(self, fn=None):
        k = self.hash.first()
        while k != None:
            key = k[0]
            val = pickle.loads(k[1])
            if key not in self.iterskip:
                if fn:
                    yield fn((key, val))
                else:
                    yield (key, val)
            try:
                k = self.hash.next()
            except KeyError:
                break

    def __contains__(self, name):
        return self.has_key(name)

    def __getattr__(self, name):
        # Pass the buck
        return getattr(self.hash, name)

    def get(self, key, dfl=None):
        if self.has_key(key):
            return self[key]
        else:
            return dfl

    def iteritems(self):
        return self.__iter__()

    def iterkeys(self):
        return self.__iter__(lambda k: k[0])

    def itervalues(self):
        return self.__iter__(lambda k: k[1])

    
class HashingGrahamBayes(classifier.GrahamBayes):
    """A database-bound GrahamBayes classifier

    This is just like classifier.GrahamBayes, except that the dictionary
    is a database.  It is WAY FASTER like this.  Smaller, too.

    You can treat instantiations of this class as persistent.  On
    destruction, they write out their state to a special key.  When you
    instantiate a new one, it will attempt to read these values out of
    that key again, so you can pick up where you left off.

    """

    def __init__(self, dbname):
        classifier.GrahamBayes.__init__(self)
        self.counterkey = "!!counters!!"
        self.wordinfo = dbdict(dbname, (self.counterkey,))
        if self.wordinfo.has_key(self.counterkey):
            self.nham, self.nspam = self.wordinfo[self.counterkey]

    def __del__(self):
        #super.__del__(self)
        self.wordinfo[self.counterkey] = (self.nham, self.nspam)

def main():
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hHg:s:u:p:c:m:o:t:')
    except getopt.error, msg:
        usage(1, msg)
 
    threshold = count = good = spam = unknown = pck = mark = output = None
    tokenize = tokenize_words_foldcase
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-H':
            describe_tokenizers(tokenize)
        elif opt == '-g':
            good = arg
        elif opt == '-s':
            spam = arg
        elif opt == '-u':
            unknown = arg
        elif opt == '-t':
            tokenize = tokenizers.get(arg)
            if tokenize is None:
                usage(1, "Unrecognized tokenize function: %s" % arg)
        elif opt == '-p':
            pck = arg
        elif opt == '-c':
            count = int(arg)
        elif opt == '-m':
            threshold = float(arg)
        elif opt == '-o':
            output = arg

    if pck:
        bayes = HashingGrahamBayes(pck)
    else:
        import tempfile

        fname = tempfile.mktemp()
        bayes = HashingGrahamBayes(fname)
        # XXX: I hear tell this trick don't work under Windows.
        os.remove(fname)

    if good:
        print 'training with the known good messages'
        i = 0
        if stat.S_ISDIR(os.stat(good)[stat.ST_MODE]):
            mbox = os.listdir(good)
            for msg in mbox:
                try:
                    bayes.learn(tokenize(open(good + "/" + msg).read()),
                                False, False)
                except IOError:
                    continue
                i += 1
                if count is not None and i > count:
                    break
        else:
            fp = open(good)
            mbox = mailbox.PortableUnixMailbox(fp, _factory)
            for msg in mbox:
                # For now we'll take an extremely naive view of
                # messages; we won't decode them at all, just to see
                # what happens.  Later, we might want to uu- or
                # base64-decode, or do other pre-processing on the
                # message.
                bayes.learn(tokenize(str(msg)), False, False)
                i += 1
                if count is not None and i > count:
                    break
            fp.close()
        save = True
        print 'done training', i, 'messages'

    if spam:
        print 'training with the known spam messages'
        i = 0
        if stat.S_ISDIR(os.stat(spam)[stat.ST_MODE]):
            mbox = os.listdir(spam)
            for msg in mbox:
                try:
                    bayes.learn(tokenize(open(spam + "/" + msg).read()),
                                False, False)
                except IOError:
                    continue
                i += 1
                if count is not None and i > count:
                    break
        else:
            fp = open(spam)
            mbox = mailbox.PortableUnixMailbox(fp, _factory)
            for msg in mbox:
                # For now we'll take an extremely naive view of
                # messages; we won't decode them at all, just to see
                # what happens.  Later, we might want to uu- or
                # base64-decode, or do other pre-processing on the
                # message.
                bayes.learn(tokenize(str(msg)), True, False)
                i += 1
                if count is not None and i > count:
                    break
            fp.close()
        save = True
        print 'done training', i, 'messages'

    if good or spam:
        bayes.update_probabilities()

    if unknown:
        if output:
            output = open(output, 'w')
        print 'classifying the unknown'
        fp = open(unknown)
        mbox = mailbox.PortableUnixMailbox(fp, email.message_from_file)
        pos = 0
        allcnt = 0
        spamcnt = goodcnt = 0
        for msg in mbox:
            msgid = msg.get('message-id', '<file offset %d>' % pos)
            pos = fp.tell()
            # For now we'll take an extremely naive view of messages; we won't
            # decode them at all, just to see what happens.  Later, we might
            # want to uu- or base64-decode, or do other pre-processing on the
            # message.
            try:
                prob = bayes.spamprob(tokenize(str(msg)))
            except ValueError:
                # Sigh, bad Content-Type
                continue
            if threshold is not None and prob > threshold:
                msg['X-Bayes-Score'] = str(prob)
            print 'P(%s) =' % msgid, prob
            if output:
                print >> output, msg
            # XXX hardcode
            if prob > 0.90:
                spamcnt += 1
            if prob < 0.09:
                goodcnt += 1
            allcnt += 1
        if output:
            output.close()
        fp.close()
        print 'Num messages =', allcnt
        print 'Good count =', goodcnt
        print 'Spam count =', spamcnt
        print 'Hard to tell =', allcnt - (goodcnt + spamcnt)

if __name__ == "__main__":
    main()

---8<---