patch to make GBayes work with maildir and anydbm
Neale Pickett
neale at woozle.org
Sat Aug 24 02:33:39 EDT 2002
Aloha.
I've been playing around with GBayes.py and classifier.py from python
CVS. Since I can't release the code I've already written to do this at
work, I've started over with Barry's code. Thanks, Barry! (And thanks
Tim (one) too, it seems.)
Since I use gnus to read mail, all my messages are one per file in a
directory. I added a quick hack to GBayes.py to have it stat the "good"
and "spam" mailboxes; if they're directories, then it assumes they're
maildir or mh (or nnmail) folders. Diff below.
Additionally, I've coded up a wrapper around GBayes and classifier that
write out to a database. It seems to be working; I have it chugging
away on my mailboxes. It is SLLOOWWWW writing. Much slower than
GBayes.py. The file it writes is about four times larger, too. Here's
what I get running on my K6/333 with Linux:
gwydion:~/src/import/spambayes$ time ./hammie.py -p poo.db -g /home/neale/Mail/inbox -s /home/neale/Mail/spam
training with the known good messages
done training 1641 messages
training with the known spam messages
done training 1066 messages
real 25m11.140s
user 19m12.530s
sys 2m11.450s
Actually, I got a traceback after training because I hadn't defined
itervalues on my dictionary class. But the point is, it's slow. Here's
GBayes.py on the same data:
gwydion:~/src/import/spambayes$ time ./GBayes.py -p poo -g /home/neale/Mail/inbox -s /home/neale/Mail/spam
training with the known good messages
done training 1641 messages
training with the known spam messages
done training 1066 messages
real 4m43.060s
user 4m19.690s
sys 0m4.850s
That difference is OUTRAGEOUS! However, GBayes.py used up over 76M of
RAM while it was running, so I don't think right now it's very practical
for a multi-user system. If it started swapping, I bet the db version
would beat it hands down.
On the other hand, reading from the dbhash is pretty quick, compared to
GBayes' gigantic pickle:
gwydion:~/src/import/spambayes$ time ./hammie.py -p poo.db -u /home/neale/Mail/lists.mbox
classifying the unknown
...
Num messages = 52
Good count = 52
Spam count = 0
Hard to tell = 0
real 0m13.185s
user 0m11.160s
sys 0m0.920s
gwydion:~/src/import/spambayes$ time ./GBayes.py -p poo -u /home/neale/Mail/lists.mbox
classifying the unknown
...
Num messages = 52
Good count = 52
Spam count = 0
Hard to tell = 0
real 1m36.108s
user 1m33.240s
sys 0m1.520s
Most of the time for GBayes was spent reading in and writing out that
huge pickle. When looking at messages, it was actually much faster than
the dbhash version though, obviously since it didn't have to go to disk
for anything.
I made one other teensy optimization: the values are only refreshed if
there were new good or spam messages to look at. That lobbed off a few
seconds of runtime.
The size of both of these databases could be reduced drastically if MIME
were decoded, which is what I plan to do next. In the meantime, here's
a diff and the source to my hammie.py.
---8<---
Index: GBayes.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/GBayes.py,v
retrieving revision 1.12
diff -u -r1.12 GBayes.py
--- GBayes.py 23 Aug 2002 15:42:48 -0000 1.12
+++ GBayes.py 24 Aug 2002 06:28:19 -0000
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/python2.2
# A driver for the classifier module. Barry Warsaw is the primary author.
@@ -32,13 +32,15 @@
describe all available tokenizing functions, then exit
When called without any options or arguments, a short self-test is run.
"""
-
+from __future__ import generators
import sys
import getopt
import cPickle as pickle
import mailbox
import email
import errno
+import stat
+import os
from classifier import GrahamBayes
@@ -431,41 +433,68 @@
# Assume Unix mailbox format
if good:
print 'training with the known good messages'
- fp = open(good)
- mbox = mailbox.PortableUnixMailbox(fp, _factory)
i = 0
- for msg in mbox:
- # For now we'll take an extremely naive view of messages; we won't
- # decode them at all, just to see what happens. Later, we might
- # want to uu- or base64-decode, or do other pre-processing on the
- # message.
- bayes.learn(tokenize(str(msg)), False, False)
- i += 1
- if count is not None and i > count:
- break
- fp.close()
+ if stat.S_ISDIR(os.stat(good)[stat.ST_MODE]):
+ mbox = os.listdir(good)
+ for msg in mbox:
+ try:
+ bayes.learn(tokenize(open(good + "/" + msg).read()),
+ False, False)
+ except IOError:
+ continue
+ i += 1
+ if count is not None and i > count:
+ break
+ else:
+ fp = open(good)
+ mbox = mailbox.PortableUnixMailbox(fp, _factory)
+ for msg in mbox:
+ # For now we'll take an extremely naive view of
+ # messages; we won't decode them at all, just to see
+ # what happens. Later, we might want to uu- or
+ # base64-decode, or do other pre-processing on the
+ # message.
+ bayes.learn(tokenize(str(msg)), False, False)
+ i += 1
+ if count is not None and i > count:
+ break
+ fp.close()
save = True
print 'done training', i, 'messages'
if spam:
print 'training with the known spam messages'
- fp = open(spam)
- mbox = mailbox.PortableUnixMailbox(fp, _factory)
i = 0
- for msg in mbox:
- # For now we'll take an extremely naive view of messages; we won't
- # decode them at all, just to see what happens. Later, we might
- # want to uu- or base64-decode, or do other pre-processing on the
- # message.
- bayes.learn(tokenize(str(msg)), True, False)
- i += 1
- if count is not None and i > count:
- break
- fp.close()
+ if stat.S_ISDIR(os.stat(spam)[stat.ST_MODE]):
+ mbox = os.listdir(spam)
+ for msg in mbox:
+ try:
+ bayes.learn(tokenize(open(spam + "/" + msg).read()),
+ False, False)
+ except IOError:
+ continue
+ i += 1
+ if count is not None and i > count:
+ break
+ else:
+ fp = open(spam)
+ mbox = mailbox.PortableUnixMailbox(fp, _factory)
+ for msg in mbox:
+ # For now we'll take an extremely naive view of
+ # messages; we won't decode them at all, just to see
+ # what happens. Later, we might want to uu- or
+ # base64-decode, or do other pre-processing on the
+ # message.
+ bayes.learn(tokenize(str(msg)), True, False)
+ i += 1
+ if count is not None and i > count:
+ break
+ fp.close()
save = True
print 'done training', i, 'messages'
- bayes.update_probabilities()
+ if good or spam:
+ bayes.update_probabilities()
if pck and save:
fp = open(pck, 'wb')
Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.1
diff -u -r1.1 classifier.py
--- classifier.py 23 Aug 2002 15:42:48 -0000 1.1
+++ classifier.py 24 Aug 2002 06:28:20 -0000
@@ -194,7 +194,8 @@
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
- for record in self.wordinfo.itervalues():
+ for k in self.wordinfo.iteritems():
+ key, record = k
# Compute prob(msg is spam | msg contains word).
hamcount = HAMBIAS * record.hamcount
spamcount = SPAMBIAS * record.spamcount
@@ -210,7 +211,9 @@
elif prob > MAX_SPAMPROB:
prob = MAX_SPAMPROB
- record.spamprob = prob
+ if prob != record.spamprob:
+ record.spamprob = prob
+ self.wordinfo[key] = record
if self.DEBUG:
print 'New probabilities:'
---8<---
hammie.py:
---8<---
#! /usr/bin/python2.2
from __future__ import generators
import classifier
import anydbm
import cPickle as pickle
import sys
import getopt
from GBayes import *
class dbdict:
def __init__(self, dbname, iterskip=()):
self.hash = anydbm.open(dbname, 'c')
self.iterskip = iterskip
def __getitem__(self, key):
if self.hash.has_key(key):
return pickle.loads(self.hash[key])
else:
raise KeyError(key)
def __setitem__(self, key, val):
v = pickle.dumps(val, 1)
self.hash[key] = v
def __delitem__(self, key, val):
del(self.hash[key])
def __iter__(self, fn=None):
k = self.hash.first()
while k != None:
key = k[0]
val = pickle.loads(k[1])
if key not in self.iterskip:
if fn:
yield fn((key, val))
else:
yield (key, val)
try:
k = self.hash.next()
except KeyError:
break
def __contains__(self, name):
return self.has_key(name)
def __getattr__(self, name):
# Pass the buck
return getattr(self.hash, name)
def get(self, key, dfl=None):
if self.has_key(key):
return self[key]
else:
return dfl
def iteritems(self):
return self.__iter__()
def iterkeys(self):
return self.__iter__(lambda k: k[0])
def itervalues(self):
return self.__iter__(lambda k: k[1])
class HashingGrahamBayes(classifier.GrahamBayes):
"""A database-bound GrahamBayes classifier
This is just like classifier.GrahamBayes, except that the dictionary
is a database. It is WAY FASTER like this. Smaller, too.
You can treat instantiations of this class as persistent. On
destruction, they write out their state to a special key. When you
instantiate a new one, it will attempt to read these values out of
that key again, so you can pick up where you left off.
"""
def __init__(self, dbname):
classifier.GrahamBayes.__init__(self)
self.counterkey = "!!counters!!"
self.wordinfo = dbdict(dbname, (self.counterkey,))
if self.wordinfo.has_key(self.counterkey):
self.nham, self.nspam = self.wordinfo[self.counterkey]
def __del__(self):
#super.__del__(self)
self.wordinfo[self.counterkey] = (self.nham, self.nspam)
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], 'hHg:s:u:p:c:m:o:t:')
except getopt.error, msg:
usage(1, msg)
threshold = count = good = spam = unknown = pck = mark = output = None
tokenize = tokenize_words_foldcase
for opt, arg in opts:
if opt == '-h':
usage(0)
elif opt == '-H':
describe_tokenizers(tokenize)
elif opt == '-g':
good = arg
elif opt == '-s':
spam = arg
elif opt == '-u':
unknown = arg
elif opt == '-t':
tokenize = tokenizers.get(arg)
if tokenize is None:
usage(1, "Unrecognized tokenize function: %s" % arg)
elif opt == '-p':
pck = arg
elif opt == '-c':
count = int(arg)
elif opt == '-m':
threshold = float(arg)
elif opt == '-o':
output = arg
if pck:
bayes = HashingGrahamBayes(pck)
else:
import tempfile
fname = tempfile.mktemp()
bayes = HashingGrahamBayes(fname)
# XXX: I hear tell this trick don't work under Windows.
os.remove(fname)
if good:
print 'training with the known good messages'
i = 0
if stat.S_ISDIR(os.stat(good)[stat.ST_MODE]):
mbox = os.listdir(good)
for msg in mbox:
try:
bayes.learn(tokenize(open(good + "/" + msg).read()),
False, False)
except IOError:
continue
i += 1
if count is not None and i > count:
break
else:
fp = open(good)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
for msg in mbox:
# For now we'll take an extremely naive view of
# messages; we won't decode them at all, just to see
# what happens. Later, we might want to uu- or
# base64-decode, or do other pre-processing on the
# message.
bayes.learn(tokenize(str(msg)), False, False)
i += 1
if count is not None and i > count:
break
fp.close()
save = True
print 'done training', i, 'messages'
if spam:
print 'training with the known spam messages'
i = 0
if stat.S_ISDIR(os.stat(spam)[stat.ST_MODE]):
mbox = os.listdir(spam)
for msg in mbox:
try:
bayes.learn(tokenize(open(spam + "/" + msg).read()),
False, False)
except IOError:
continue
i += 1
if count is not None and i > count:
break
else:
fp = open(spam)
mbox = mailbox.PortableUnixMailbox(fp, _factory)
for msg in mbox:
# For now we'll take an extremely naive view of
# messages; we won't decode them at all, just to see
# what happens. Later, we might want to uu- or
# base64-decode, or do other pre-processing on the
# message.
bayes.learn(tokenize(str(msg)), True, False)
i += 1
if count is not None and i > count:
break
fp.close()
save = True
print 'done training', i, 'messages'
if good or spam:
bayes.update_probabilities()
if unknown:
if output:
output = open(output, 'w')
print 'classifying the unknown'
fp = open(unknown)
mbox = mailbox.PortableUnixMailbox(fp, email.message_from_file)
pos = 0
allcnt = 0
spamcnt = goodcnt = 0
for msg in mbox:
msgid = msg.get('message-id', '<file offset %d>' % pos)
pos = fp.tell()
# For now we'll take an extremely naive view of messages; we won't
# decode them at all, just to see what happens. Later, we might
# want to uu- or base64-decode, or do other pre-processing on the
# message.
try:
prob = bayes.spamprob(tokenize(str(msg)))
except ValueError:
# Sigh, bad Content-Type
continue
if threshold is not None and prob > threshold:
msg['X-Bayes-Score'] = str(prob)
print 'P(%s) =' % msgid, prob
if output:
print >> output, msg
# XXX hardcode
if prob > 0.90:
spamcnt += 1
if prob < 0.09:
goodcnt += 1
allcnt += 1
if output:
output.close()
fp.close()
print 'Num messages =', allcnt
print 'Good count =', goodcnt
print 'Spam count =', spamcnt
print 'Hard to tell =', allcnt - (goodcnt + spamcnt)
if __name__ == "__main__":
main()
---8<---
More information about the Python-list
mailing list