[Spambayes-checkins] spambayes/spambayes Corpus.py,NONE,1.1.2.1
CostCounter.py,NONE,1.1.2.1 FileCorpus.py,NONE,1.1.2.1
Histogram.py,NONE,1.1.2.1 Options.py,NONE,1.1.2.1
TestDriver.py,NONE,1.1.2.1 Tester.py,NONE,1.1.2.1
__init__.py,NONE,1.1.2.1 cdb.py,NONE,1.1.2.1 chi2.py,NONE,1.1.2.1
classifier.py,NONE,1.1.2.1 dbmstorage.py,NONE,1.1.2.1
hammie.py,NONE,1.1.2.1 hammiebulk.py,NONE,1.1.2.1
mboxutils.py,NONE,1.1.2.1 msgs.py,NONE,1.1.2.1
optimize.py,NONE,1.1.2.1 storage.py,NONE,1.1.2.1tokenizer.py,NONE,1.1.2.1
Anthony Baxter
anthonybaxter at users.sourceforge.net
Fri Jan 10 02:41:10 EST 2003
- Previous message: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.6,1.6.2.1
- Next message: [Spambayes-checkins]
spambayes/testtools cmp.py,NONE,1.1.2.1 fpfn.py,NONE,1.1.2.1
mboxtest.py,NONE,1.1.2.1 rates.py,NONE,1.1.2.1
simplexloop.py,NONE,1.1.2.1 table.py,NONE,1.1.2.1
timcv.py,NONE,1.1.2.1 timtest.py,NONE,1.1.2.1weaktest.py,NONE,1.1.2.1
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9389/spambayes
Added Files:
Tag: reorg-branch
Corpus.py CostCounter.py FileCorpus.py Histogram.py Options.py
TestDriver.py Tester.py __init__.py cdb.py chi2.py
classifier.py dbmstorage.py hammie.py hammiebulk.py
mboxutils.py msgs.py optimize.py storage.py tokenizer.py
Log Message:
Checkpointing before I head home.
Still to do:
- distutils magic to make sure that the 22compat modules are
installed when needed.
- Walking through testtools and utilities and fixing imports.
- Documentation.
hammie works, everything else that people use in day-to-day operation
should work - please give it a go.
--- NEW FILE: Corpus.py ---
#! /usr/bin/env python
'''Corpus.py - Spambayes corpus management framework.
Classes:
Corpus - a collection of Messages
ExpiryCorpus - a "young" Corpus
Message - a subject of Spambayes training
MessageFactory - creates a Message
Abstract:
A corpus is defined as a set of messages that share some common
characteristic relative to spamness. Examples might be spam, ham,
unsure, or untrained, or "bayes rating between .4 and .6. A
corpus is a collection of messages. Corpus is a dictionary that
is keyed by the keys of the messages within it. It is iterable,
and observable. Observers are notified when a message is added
to or removed from the corpus.
Corpus is designed to cache message objects. By default, it will
only engage in lazy creation of message objects, keeping those
objects in memory until the corpus instance itself is destroyed.
In large corpora, this could consume a large amount of memory. A
cacheSize operand is implemented on the constructor, which is used
to limit the *number* of messages currently loaded into memory.
The instance variable that implements this cache is
Corpus.Corpus.msgs, a dictionary. Access to this variable should
be through keys(), [key], or using an iterator. Direct access
should not be used, as subclasses that manage their cache may use
this variable very differently.
Iterating Corpus objects is potentially very expensive, as each
message in the corpus will be brought into memory. For large
corpora, this could consume a lot of system resources.
ExpiryCorpus is designed to keep a corpus of file messages that
are guaranteed to be younger than a given age. The age is
specified on the constructor, as a number of seconds in the past.
If a message file was created before that point in time, the a
message is deemed to be "old" and thus ignored. Access to a
message that is deemed to be old will raise KeyError, which should
be handled by the corpus user as appropriate. While iterating,
KeyError is handled by the iterator, and messages that raise
KeyError are ignored.
As messages pass their "expiration date," they are eligible for
removal from the corpus. To remove them properly,
removeExpiredMessages() should be called. As messages are removed,
observers are notified.
ExpiryCorpus function is included into a concrete Corpus through
multiple inheritance. It must be inherited before any inheritance
that derives from Corpus. For example:
class RealCorpus(Corpus)
...
class ExpiryRealCorpus(Corpus.ExpiryCorpus, RealCorpus)
...
Messages have substance, which is is the textual content of the
message. They also have a key, which uniquely defines them within
the corpus. This framework makes no assumptions about how or if
messages persist.
MessageFactory is a required factory class, because Corpus is
designed to do lazy initialization of messages and as an abstract
class, must know how to create concrete instances of the correct
class.
To Do:
o Suggestions?
'''
# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.
__author__ = "Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."
from __future__ import generators
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
def bool(val):
return not not val
import sys # for output of docstring
import time
import re
from spambayes import tokenizer
from spambayes.Options import options
SPAM = True
HAM = False
class Corpus:
'''An observable dictionary of Messages'''
def __init__(self, factory, cacheSize=-1):
'''Constructor(MessageFactory)'''
self.msgs = {} # dict of all messages in corpus
# value is None if msg not currently loaded
self.keysInMemory = [] # keys of messages currently loaded
# this *could* be derived by iterating msgs
self.cacheSize = cacheSize # max number of messages in memory
self.observers = [] # observers of this corpus
self.factory = factory # factory for the correct Message subclass
def addObserver(self, observer):
'''Register an observer, which must implement
onAddMessage, onRemoveMessage'''
self.observers.append(observer)
def addMessage(self, message):
'''Add a Message to this corpus'''
if options.verbose:
print 'adding message %s to corpus' % (message.key())
self.cacheMessage(message)
for obs in self.observers:
# there is no reason that a Corpus observer MUST be a Trainer
# and so it may very well not be interested in AddMessage events
# even though right now the only observable events are
# training related
try:
obs.onAddMessage(message)
except AttributeError: # ignore if not implemented
pass
def removeMessage(self, message):
'''Remove a Message from this corpus'''
key = message.key()
if options.verbose:
print 'removing message %s from corpus' % (key)
self.unCacheMessage(key)
del self.msgs[key]
for obs in self.observers:
# see comments in event loop in addMessage
try:
obs.onRemoveMessage(message)
except AttributeError:
pass
def cacheMessage(self, message):
'''Add a message to the in-memory cache'''
# This method should probably not be overridden
key = message.key()
if options.verbose:
print 'placing %s in corpus cache' % (key)
self.msgs[key] = message
# Here is where we manage the in-memory cache size...
self.keysInMemory.append(key)
if self.cacheSize > 0: # performance optimization
if len(self.keysInMemory) > self.cacheSize:
keyToFlush = self.keysInMemory[0]
self.unCacheMessage(keyToFlush)
def unCacheMessage(self, key):
'''Remove a message from the in-memory cache'''
# This method should probably not be overridden
if options.verbose:
print 'Flushing %s from corpus cache' % (key)
try:
ki = self.keysInMemory.index(key)
except ValueError:
pass
else:
del self.keysInMemory[ki]
self.msgs[key] = None
def takeMessage(self, key, fromcorpus):
'''Move a Message from another corpus to this corpus'''
# XXX Hack: Calling msg.getSubstance() here ensures that the
# message substance is in memory. If it isn't, when addMessage()
# calls message.store(), which calls message.getSubstance(), that
# will try to load the substance from the as-yet-unwritten new file.
msg = fromcorpus[key]
msg.getSubstance()
fromcorpus.removeMessage(msg)
self.addMessage(msg)
def __getitem__(self, key):
'''Corpus is a dictionary'''
amsg = self.msgs[key]
if not amsg:
amsg = self.makeMessage(key) # lazy init, saves memory
self.cacheMessage(amsg)
return amsg
def keys(self):
'''Message keys in the Corpus'''
return self.msgs.keys()
def __iter__(self):
'''Corpus is iterable'''
for key in self.keys():
try:
yield self[key]
except KeyError:
pass
def __str__(self):
'''Instance as a printable string'''
return self.__repr__()
def __repr__(self):
'''Instance as a representative string'''
raise NotImplementedError
def makeMessage(self, key):
'''Call the factory to make a message'''
# This method will likely be overridden
msg = self.factory.create(key)
return msg
class ExpiryCorpus:
'''Corpus of "young" file system artifacts'''
def __init__(self, expireBefore):
'''Constructor'''
self.expireBefore = expireBefore
def removeExpiredMessages(self):
'''Kill expired messages'''
for msg in self:
if msg.createTimestamp() < time.time() - self.expireBefore:
if options.verbose:
print 'message %s has expired' % (key)
self.removeMessage(msg)
class Message:
'''Abstract Message class'''
def __init__(self):
'''Constructor()'''
# The text of the message headers and body are held in attributes
# called 'hdrtxt' and 'payload', created on demand in __getattr__
# by calling load(), which should in turn call setSubstance().
# This means you don't need to remember to call load() before
# using these attributes.
def __getattr__(self, attributeName):
'''On-demand loading of the message text.'''
if attributeName in ('hdrtxt', 'payload'):
self.load()
return getattr(self, attributeName)
def load(self):
'''Method to load headers and body'''
raise NotImplementedError
def store(self):
'''Method to persist a message'''
raise NotImplementedError
def remove(self):
'''Method to obliterate a message'''
raise NotImplementedError
def __repr__(self):
'''Instance as a representative string'''
raise NotImplementedError
def __str__(self):
'''Instance as a printable string'''
return self.getSubstance()
def name(self):
'''Message may have a unique human readable name'''
return self.__repr__()
def key(self):
'''The key for this instance'''
raise NotImplementedError
def setSubstance(self, sub):
'''set this message substance'''
bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
bmatch = bodyRE.search(sub)
if bmatch:
self.payload = bmatch.group(2)
self.hdrtxt = sub[:bmatch.start(2)]
def getSubstance(self):
'''Return this message substance'''
return self.hdrtxt + self.payload
def setSpamprob(self, prob):
'''Score of the last spamprob calc, may not be persistent'''
self.spamprob = prob
def tokenize(self):
'''Returns substance as tokens'''
return tokenizer.tokenize(self.getSubstance())
def createTimeStamp(self):
'''Returns the create time of this message'''
# Should return a timestamp like time.time()
raise NotImplementedError
def getFrom(self):
'''Return a message From header content'''
if self.hdrtxt:
match = re.search(r'^From:(.*)$', self.hdrtxt, re.MULTILINE)
return match.group(1)
else:
return None
def getSubject(self):
'''Return a message Subject header contents'''
if self.hdrtxt:
match = re.search(r'^Subject:(.*)$', self.hdrtxt, re.MULTILINE)
return match.group(1)
else:
return None
def getDate(self):
'''Return a message Date header contents'''
if self.hdrtxt:
match = re.search(r'^Date:(.*)$', self.hdrtxt, re.MULTILINE)
return match.group(1)
else:
return None
def getHeadersList(self):
'''Return a list of message header tuples'''
hdrregex = re.compile(r'^([A-Za-z0-9-_]*): ?(.*)$', re.MULTILINE)
data = re.sub(r'\r?\n\r?\s',' ',self.hdrtxt,re.MULTILINE)
match = hdrregex.findall(data)
return match
def getHeaders(self):
'''Return message headers as text'''
return self.hdrtxt
def getPayload(self):
'''Return the message body'''
return self.payload
def stripSBDHeader(self):
'''Removes the X-Spambayes-Disposition: header from the message'''
# This is useful for training, where a spammer may be spoofing
# our header, to make sure that our header doesn't become an
# overweight clue to hamminess
raise NotImplementedError
class MessageFactory:
'''Abstract Message Factory'''
def __init__(self):
'''Constructor()'''
pass
def create(self, key):
'''Create a message instance'''
raise NotImplementedError
if __name__ == '__main__':
print >>sys.stderr, __doc__
--- NEW FILE: CostCounter.py ---
from spambayes.Options import options
class CostCounter:
name = "Superclass Cost"
def __init__(self):
self.total = 0
def spam(self, scr):
pass
def ham(self, scr):
pass
def __str__(self):
return "%s: $%.4f" % (self.name, self.total)
class CompositeCostCounter:
def __init__(self,cclist):
self.clients = cclist
def spam(self, scr):
for c in self.clients:
c.spam(scr)
def ham(self, scr):
for c in self.clients:
c.ham(scr)
def __str__(self):
s = []
for c in self.clients:
s.append(str(c))
return '\n'.join(s)
class DelayedCostCounter(CompositeCostCounter):
def __init__(self,cclist):
CompositeCostCounter.__init__(self,cclist)
self.spamscr=[]
self.hamscr=[]
def spam(self, scr):
self.spamscr.append(scr)
def ham(self, scr):
self.hamscr.append(scr)
def __str__(self):
for scr in self.spamscr:
CompositeCostCounter.spam(self,scr)
for scr in self.hamscr:
CompositeCostCounter.ham(self,scr)
s=[]
for line in CompositeCostCounter.__str__(self).split('\n'):
s.append('Delayed-'+line)
return '\n'.join(s)
class CountCostCounter(CostCounter):
def __init__(self):
CostCounter.__init__(self)
self._fp = 0
self._fn = 0
self._unsure = 0
self._unsureham = 0
self._unsurespam = 0
self._spam = 0
self._ham = 0
self._correctham = 0
self._correctspam = 0
self._total = 0
def spam(self, scr):
self._total += 1
self._spam += 1
if scr < options.ham_cutoff:
self._fn += 1
elif scr < options.spam_cutoff:
self._unsure += 1
self._unsurespam += 1
else:
self._correctspam += 1
def ham(self, scr):
self._total += 1
self._ham += 1
if scr > options.spam_cutoff:
self._fp += 1
elif scr > options.ham_cutoff:
self._unsure += 1
self._unsureham += 1
else:
self._correctham += 1
def __str__(self):
return ("Total messages: %d; %d (%.1f%%) ham + %d (%.1f%%) spam\n"%(
self._total,
self._ham, zd(100.*self._ham,self._total),
self._spam, zd(100.*self._spam,self._total))+
"Ham: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fp\n"%(
self._correctham, zd(100.*self._correctham,self._ham),
self._unsureham, zd(100.*self._unsureham,self._ham),
self._fp, zd(100.*self._fp,self._ham))+
"Spam: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fn\n"%(
self._correctspam, zd(100.*self._correctspam,self._spam),
self._unsurespam, zd(100.*self._unsurespam,self._spam),
self._fn, zd(100.*self._fn,self._spam))+
"Score False: %.2f%% Unsure %.2f%%"%(
zd(100.*(self._fp+self._fn),self._total),
zd(100.*self._unsure,self._total)))
def zd(x,y):
if y > 0:
return x / y
else:
return 0
class StdCostCounter(CostCounter):
name = "Standard Cost"
def spam(self, scr):
if scr < options.ham_cutoff:
self.total += options.best_cutoff_fn_weight
elif scr < options.spam_cutoff:
self.total += options.best_cutoff_unsure_weight
def ham(self, scr):
if scr > options.spam_cutoff:
self.total += options.best_cutoff_fp_weight
elif scr > options.ham_cutoff:
self.total += options.best_cutoff_unsure_weight
class FlexCostCounter(CostCounter):
name = "Flex Cost"
def _lambda(self, scr):
if scr < options.ham_cutoff:
return 0
elif scr > options.spam_cutoff:
return 1
else:
return (scr - options.ham_cutoff) / (
options.spam_cutoff - options.ham_cutoff)
def spam(self, scr):
self.total += (1 - self._lambda(scr)) * options.best_cutoff_fn_weight
def ham(self, scr):
self.total += self._lambda(scr) * options.best_cutoff_fp_weight
class Flex2CostCounter(FlexCostCounter):
name = "Flex**2 Cost"
def spam(self, scr):
self.total += (1 - self._lambda(scr))**2 * options.best_cutoff_fn_weight
def ham(self, scr):
self.total += self._lambda(scr)**2 * options.best_cutoff_fp_weight
def default():
return CompositeCostCounter([
CountCostCounter(),
StdCostCounter(),
FlexCostCounter(),
Flex2CostCounter(),
DelayedCostCounter([
CountCostCounter(),
StdCostCounter(),
FlexCostCounter(),
Flex2CostCounter(),
])
])
def nodelay():
return CompositeCostCounter([
CountCostCounter(),
StdCostCounter(),
FlexCostCounter(),
Flex2CostCounter(),
])
if __name__=="__main__":
cc=default()
cc.ham(0)
cc.spam(1)
cc.ham(0.5)
cc.spam(0.5)
options.spam_cutoff=0.7
options.ham_cutoff=0.4
print cc
--- NEW FILE: FileCorpus.py ---
#! /usr/bin/env python
"""FileCorpus.py - Corpus composed of file system artifacts
Classes:
FileCorpus - an observable dictionary of FileMessages
ExpiryFileCorpus - a FileCorpus of young files
FileMessage - a subject of Spambayes training
FileMessageFactory - a factory to create FileMessage objects
GzipFileMessage - A FileMessage zipped for less storage
GzipFileMessageFactory - factory to create GzipFileMessage objects
Abstract:
These classes are concrete implementations of the Corpus framework.
FileCorpus is designed to manage corpora that are directories of
message files.
ExpiryFileCorpus is an ExpiryCorpus of file messages.
FileMessage manages messages that are files in the file system.
FileMessageFactory is responsible for the creation of FileMessages,
in response to requests to a corpus for messages.
GzipFileMessage and GzipFileMessageFactory are used to persist messages
as zipped files. This can save a bit of persistent storage, though the
ability of the compresser to do very much deflation is limited due to the
relatively small size of the average textual message. Still, for a large
corpus, this could amount to a significant space savings.
See Corpus.__doc__ for more information.
Test harness:
FileCorpus [options]
options:
-h : show this message
-v : execute in verbose mode, useful for general understanding
and debugging purposes
-g : use GzipFileMessage and GzipFileMessageFactory
-s : setup self test, useful for seeing what is going into the
test
-t : setup and execute a self test.
-c : clean up file system after self test
Please note that running with -s or -t will create file system artifacts
in the current directory. Be sure this doesn't stomp something of
yours... The artifacts created are:
fctestmisc.bayes
fctestclass.bayes
fctestspamcorpus/MSG00001
fctestspamcorpus/MSG00002
fctestunsurecorpus/MSG00003
fctestunsurecorpus/MSG00004
fctestunsurecorpus/MSG00005
fctestunsurecorpus/MSG00006
fctesthamcorpus/
After the test has executed, the following file system artifacts
(should) will exist:
fctestmisc.bayes
fctestclass.bayes
fctestspamcorpus/MSG00001
fctestspamcorpus/MSG00004
fctesthamcorpus/MSG00002
fctesthamcorpus/MSG00005
fctesthamcorpus/MSG00006
fctestunsurecorpus/
To Do:
o Suggestions?
"""
# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.
__author__ = "Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."
from __future__ import generators
from spambayes import Corpus
from spambayes import storage
import sys, os, gzip, fnmatch, getopt, errno, time, stat
from spambayes.Options import options
class FileCorpus(Corpus.Corpus):
def __init__(self, factory, directory, filter='*', cacheSize=250):
'''Constructor(FileMessageFactory, corpus directory name, fnmatch
filter'''
Corpus.Corpus.__init__(self, factory, cacheSize)
self.directory = directory
self.filter = filter
# This assumes that the directory exists. A horrible death occurs
# otherwise. We *could* simply create it, but that will likely only
# mask errors
# This will not pick up any changes to the corpus that are made
# through the file system. The key list is established in __init__,
# and if anybody stores files in the directory, even if they match
# the filter, they won't make it into the key list. The same
# problem exists if anybody removes files. This *could* be a problem.
# If so, we can maybe override the keys() method to account for this,
# but there would be training side-effects... The short of it is that
# corpora that are managed by FileCorpus should *only* be managed by
# FileCorpus (at least for now). External changes that must be made
# to the corpus should for the moment be handled by a complete
# retraining.
for filename in os.listdir(directory):
if fnmatch.fnmatch(filename, filter):
self.msgs[filename] = None
def makeMessage(self, key):
'''Ask our factory to make a Message'''
msg = self.factory.create(key, self.directory)
return msg
def addMessage(self, message):
'''Add a Message to this corpus'''
if not fnmatch.fnmatch(message.key(), self.filter):
raise ValueError
if options.verbose:
print 'adding',message.key(),'to corpus'
message.directory = self.directory
message.store()
# superclass processing *MUST* be done
# perform superclass processing *LAST!*
Corpus.Corpus.addMessage(self, message)
def removeMessage(self, message):
'''Remove a Message from this corpus'''
if options.verbose:
print 'removing',message.key(),'from corpus'
message.remove()
# superclass processing *MUST* be done
# perform superclass processing *LAST!*
Corpus.Corpus.removeMessage(self, message)
def __repr__(self):
'''Instance as a representative string'''
nummsgs = len(self.msgs)
if nummsgs != 1:
s = 's'
else:
s = ''
if options.verbose and nummsgs > 0:
lst = ', ' + '%s' % (self.keys())
else:
lst = ''
return "<%s object at %8.8x, directory: %s, %s message%s%s>" % \
(self.__class__.__name__, \
id(self), \
self.directory, \
nummsgs, s, lst)
class ExpiryFileCorpus(Corpus.ExpiryCorpus, FileCorpus):
'''FileCorpus of "young" file system artifacts'''
def __init__(self, expireBefore, factory, directory, filter='*', cacheSize=250):
'''Constructor(FileMessageFactory, corpus directory name, fnmatch
filter'''
Corpus.ExpiryCorpus.__init__(self, expireBefore)
FileCorpus.__init__(self, factory, directory, filter, cacheSize)
class FileMessage(Corpus.Message):
'''Message that persists as a file system artifact.'''
def __init__(self,file_name, directory):
'''Constructor(message file name, corpus directory name)'''
Corpus.Message.__init__(self)
self.file_name = file_name
self.directory = directory
# No calling of self.load() here - that's done on demand by
# Message.__getattr__.
def pathname(self):
'''Derive the pathname of the message file'''
return os.path.join(self.directory, self.file_name)
def load(self):
'''Read the Message substance from the file'''
if options.verbose:
print 'loading', self.file_name
pn = self.pathname()
try:
fp = open(pn, 'rb')
except IOError, e:
if e.errno != errno.ENOENT:
raise
else:
self.setSubstance(fp.read())
fp.close()
def store(self):
'''Write the Message substance to the file'''
if options.verbose:
print 'storing', self.file_name
pn = self.pathname()
fp = open(pn, 'wb')
fp.write(self.getSubstance())
fp.close()
def remove(self):
'''Message hara-kiri'''
if options.verbose:
print 'physically deleting file',self.pathname()
os.unlink(self.pathname())
def name(self):
'''A unique name for the message'''
return self.file_name
def key(self):
'''The key of this message in the msgs dictionary'''
return self.file_name
def __repr__(self):
'''Instance as a representative string'''
elip = ''
sub = self.getSubstance()
if options.verbose:
sub = self.getSubstance()
else:
if len(sub) > 20:
sub = sub[:20]
if len(sub) > 40:
sub += '...' + sub[-20:]
pn = os.path.join(self.directory, self.file_name)
return "<%s object at %8.8x, file: %s, %s>" % \
(self.__class__.__name__, \
id(self), pn, sub)
def __str__(self):
'''Instance as a printable string'''
return self.__repr__()
def createTimestamp(self):
'''Return the create timestamp for the file'''
stats = os.stat(self.pathname())
ctime = stats[stat.ST_CTIME]
return ctime
class FileMessageFactory(Corpus.MessageFactory):
'''MessageFactory for FileMessage objects'''
def create(self, key, directory):
'''Create a message object from a filename in a directory'''
return FileMessage(key, directory)
class GzipFileMessage(FileMessage):
'''Message that persists as a zipped file system artifact.'''
def load(self):
'''Read the Message substance from the file'''
if options.verbose:
print 'loading', self.file_name
pn = self.pathname()
try:
fp = gzip.open(pn, 'rb')
except IOError, e:
if e.errno != errno.ENOENT:
raise
else:
self.setSubstance(fp.read())
fp.close()
def store(self):
'''Write the Message substance to the file'''
if options.verbose:
print 'storing', self.file_name
pn = self.pathname()
gz = gzip.open(pn, 'wb')
gz.write(self.getSubstance())
gz.flush()
gz.close()
class GzipFileMessageFactory(FileMessageFactory):
'''MessageFactory for FileMessage objects'''
def create(self, key, directory):
'''Create a message object from a filename in a directory'''
return GzipFileMessage(key, directory)
def runTest(useGzip):
print 'Executing Test'
if useGzip:
fmFact = GzipFileMessageFactory()
print 'Executing with Gzipped files'
else:
fmFact = FileMessageFactory()
print 'Executing with uncompressed files'
print '\n\nCreating two Classifier databases'
miscbayes = storage.PickledClassifier('fctestmisc.bayes')
classbayes = storage.DBDictClassifier('fctestclass.bayes')
print '\n\nSetting up spam corpus'
spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
spamtrainer = storage.SpamTrainer(miscbayes)
spamcorpus.addObserver(spamtrainer)
anotherspamtrainer = storage.SpamTrainer(classbayes, storage.UPDATEPROBS)
spamcorpus.addObserver(anotherspamtrainer)
keys = spamcorpus.keys()
keys.sort()
for key in keys: # iterate the list of keys
msg = spamcorpus[key] # corpus is a dictionary
spamtrainer.train(msg)
anotherspamtrainer.train(msg)
print '\n\nSetting up ham corpus'
hamcorpus = FileCorpus(fmFact, \
'fctesthamcorpus', \
'MSG*')
hamtrainer = storage.HamTrainer(miscbayes)
hamcorpus.addObserver(hamtrainer)
hamtrainer.trainAll(hamcorpus)
print '\n\nA couple of message related tests'
if useGzip:
fmClass = GzipFileMessage
else:
fmClass = FileMessage
m1 = fmClass('XMG00001', 'fctestspamcorpus')
m1.setSubstance(testmsg2())
print '\n\nAdd a message to hamcorpus that does not match the filter'
try:
hamcorpus.addMessage(m1)
except ValueError:
print 'Add failed, test passed'
else:
print 'Add passed, test failed'
print '\n\nThis is the hamcorpus'
print hamcorpus
print '\n\nThis is the spamcorpus'
print spamcorpus
print '\n\nSetting up unsure corpus'
# the unsure corpus is an expiry corpus with five second expiry
# and a cache size of 2 (for testing purposes only...), and
# no trainers, since there's no such thing as 'unsure training'
unsurecorpus = ExpiryFileCorpus(5, fmFact, \
'fctestunsurecorpus', 'MSG*', 2)
unsurecorpus.removeExpiredMessages()
print '\n\nIterate the unsure corpus twice, to make sure cache size \
is managed correctly, and to make sure iteration is repeatable. \
We should not see MSG00003 in this iteration.'
for msg in unsurecorpus:
print msg.key() # don't print msg, too much information
print '...and again'
for msg in unsurecorpus:
print msg.key() # don't print msg, too much information
print '\n\nRemoving expired messages from unsure corpus.'
unsurecorpus.removeExpiredMessages()
print '\n\nTrain with an individual message'
anotherhamtrainer = storage.HamTrainer(classbayes)
anotherhamtrainer.train(unsurecorpus['MSG00005'])
print '\n\nMoving msg00002 from spamcorpus to hamcorpus'
hamcorpus.takeMessage('MSG00002', spamcorpus) # Oops. made a mistake...
print "\n\nLet's test printing a message"
msg = spamcorpus['MSG00001']
print msg
print '\n\nThis is some vital information in the message'
print 'Date header is',msg.getDate()
print 'Subject header is',msg.getSubject()
print 'From header is',msg.getFrom()
print 'Header text is:',msg.getHeaders()
print 'Headers are:',msg.getHeadersList()
print 'Body is:',msg.getPayload()
print '\n\nClassifying messages in unsure corpus'
for msg in unsurecorpus:
prob = classbayes.spamprob(msg.tokenize())
print 'Message %s spam probability is %f' % (msg.key(), prob)
if prob < options.ham_cutoff:
print 'Moving %s from unsurecorpus to hamcorpus, \
based on prob of %f' % (msg.key(), prob)
hamcorpus.takeMessage(msg.key(), unsurecorpus)
elif prob > options.spam_cutoff:
print 'Moving %s from unsurecorpus to spamcorpus, \
based on prob of %f' % (msg.key(), prob)
spamcorpus.takeMessage(msg.key(), unsurecorpus)
print '\n\nThis is the new hamcorpus'
print hamcorpus
print '\n\nThis is the new spamcorpus'
print spamcorpus
print '\n\nThis is the new unsurecorpus'
print unsurecorpus
print 'unsurecorpus cache contains', unsurecorpus.keysInMemory
print 'unsurecorpus msgs dict contains', unsurecorpus.msgs
print '\n\nStoring bayes databases'
miscbayes.store()
classbayes.store()
def cleanupTest():
print 'Cleaning up'
cleanupDirectory('fctestspamcorpus')
cleanupDirectory('fctesthamcorpus')
cleanupDirectory('fctestunsurecorpus')
if not useExistingDB:
try:
os.unlink('fctestmisc.bayes')
except OSError, e:
if e.errno != 2: # errno.<WHAT>
raise
try:
os.unlink('fctestclass.bayes')
except OSError, e:
if e.errno != 2: # errno.<WHAT>
raise
def cleanupDirectory(dirname):
try:
flist = os.listdir(dirname)
except OSError, e:
if e.errno != 3: # errno.<WHAT>
raise
else:
for filename in os.listdir(dirname):
fn = os.path.join(dirname, filename)
os.unlink(fn)
try:
os.rmdir(dirname)
except OSError, e:
if e.errno != 2: # errno.<WHAT>
raise
def setupTest(useGzip):
cleanupTest()
print 'Setting up test'
# no try blocks here, because if any of this dies, the test
# cannot proceed
os.mkdir('fctestspamcorpus')
os.mkdir('fctesthamcorpus')
os.mkdir('fctestunsurecorpus')
tm1 = testmsg1()
tm2 = testmsg2()
if useGzip:
fmClass = GzipFileMessage
else:
fmClass = FileMessage
m1 = fmClass('MSG00001', 'fctestspamcorpus')
m1.setSubstance(tm1)
m1.store()
m2 = fmClass('MSG00002', 'fctestspamcorpus')
m2.setSubstance(tm2)
m2.store()
m3 = fmClass('MSG00003', 'fctestunsurecorpus')
m3.setSubstance(tm1)
m3.store()
for x in range(11):
time.sleep(1) # make sure MSG00003 has expired
if 10-x == 1:
s = ''
else:
s = 's'
print 'wait',10-x,'more second%s' % (s)
m4 = fmClass('MSG00004', 'fctestunsurecorpus')
m4.setSubstance(tm1)
m4.store()
m5 = fmClass('MSG00005', 'fctestunsurecorpus')
m5.setSubstance(tm2)
m5.store()
m6 = fmClass('MSG00006', 'fctestunsurecorpus')
m6.setSubstance(tm2)
m6.store()
def testmsg1():
return """
X-Hd:skip at pobox.com Mon Nov 04 10:50:49 2002
Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
1998/05/13) Mon Nov 4 08:50:58 2002)
X-From_:skip at mojam.com Mon Nov 4 08:49:03 2002
Return-Path:<skip at mojam.com>
Delivered-To:timstone at mail.powweb.com
Received:from manatee.mojam.com (manatee.mojam.com [199.249.165.175]) by
mail.powweb.com (Postfix) with ESMTP id DC95A1BB1D0 for
<tim at fourstonesExpressions.com>; Mon, 4 Nov 2002 08:49:02 -0800 (PST)
Received:from montanaro.dyndns.org (12-248-11-90.client.attbi.com
[12.248.11.90]) by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id
gA4Gn0oY029655 for <tim at fourstonesExpressions.com>; Mon, 4 Nov 2002 10:49:00
-0600
Received:from montanaro.dyndns.org (localhost [127.0.0.1]) by
montanaro.dyndns.org (8.12.2/8.12.2) with ESMTP id gA4Gn3cP015572 for
<tim at fourstonesExpressions.com>; Mon, 4 Nov 2002 10:49:03 -0600 (CST)
Received:(from skip at localhost) by montanaro.dyndns.org (8.12.2/8.12.2/Submit)
id gA4Gn37l015569; Mon, 4 Nov 2002 10:49:03 -0600 (CST)
From:Skip Montanaro <skip at pobox.com>
MIME-Version:1.0
Content-Type:text/plain; charset=us-ascii
Content- Transfer- Encoding:7bit
Message-ID:<15814.42238.882013.702030 at montanaro.dyndns.org>
Date:Mon, 4 Nov 2002 10:49:02 -0600
To:Four Stones Expressions <tim at fourstonesExpressions.com>
Subject:Reformat mail to 80 columns?
In-Reply-To:<QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1 at riven>
References:<8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc at riven>
<QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1 at riven>
X-Mailer:VM 7.07 under 21.5 (beta9) "brussels sprouts" XEmacs Lucid
Reply-To:skip at pobox.com
X-Hammie- Disposition:Unsure
11/4/2002 10:49:02 AM, Skip Montanaro <skip at pobox.com> wrote:
>(off-list)
>
>Tim,
>
>Any chance you can easily generate messages to the spambayes list which wrap
>at something between 70 and 78 columns? I find I have to always edit your
>messages to read them easily.
>
>Thanks,
>
>--
>Skip Montanaro - skip at pobox.com
>http://www.mojam.com/
>http://www.musi-cal.com/
>
>
- Tim
www.fourstonesExpressions.com """
def testmsg2():
return """
X-Hd:richie at entrian.com Wed Nov 06 12:05:41 2002
Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
1998/05/13) Wed Nov 6 10:05:45 2002)
X-From_:richie at entrian.com Wed Nov 6 10:05:33 2002
Return-Path:<richie at entrian.com>
Delivered-To:timstone at mail.powweb.com
Received:from anchor-post-31.mail.demon.net (anchor-post-31.mail.demon.net
[194.217.242.89]) by mail.powweb.com (Postfix) with ESMTP id 3DC431BB06A for
<tim at fourstonesexpressions.com>; Wed, 6 Nov 2002 10:05:33 -0800 (PST)
Received:from sundog.demon.co.uk ([158.152.226.183]) by
anchor-post-31.mail.demon.net with smtp (Exim 3.35 #1) id 189UYP-000IAw-0V for
tim at fourstonesExpressions.com; Wed, 06 Nov 2002 18:05:25 +0000
From:Richie Hindle <richie at entrian.com>
To:tim at fourstonesExpressions.com
Subject:Re: What to call this training stuff
Date:Wed, 06 Nov 2002 18:05:56 +0000
Organization:entrian.com
Reply-To:richie at entrian.com
Message-ID:<d0hisugn3nau4m704kotgpd4jlt33rvrda at 4ax.com>
References:<IFWRHE041VTXW72JGDBD0RTS04YTGE.3dc933a1 at riven>
In-Reply-To:<IFWRHE041VTXW72JGDBD0RTS04YTGE.3dc933a1 at riven>
X-Mailer:Forte Agent 1.7/32.534
MIME-Version:1.0
Content-Type:text/plain; charset=us-ascii
Content- Transfer- Encoding:7bit
X-Hammie- Disposition:Unsure
Hi Tim,
> Richie, I think we should package these classes I've been writing as
> 'corpusManagement.py' What we're really doing here is creating a set of
tools
> that can be used to manage corpi (?) corpusses (?) corpae (?) whatever...
of
> messages.
Good plan. Minor point of style: mixed-case module names (like class
names) tend to have an initial capital: CorpusManagement.py
On the name... sorry to disagree about names again, but what does the word
'management' add? This is a module for manipulating corpuses, so I reckon
it should be called Corpus. Like Cookie, gzip, zipfile, locale, mailbox...
see what I mean?
--
Richie Hindle
richie at entrian.com"""
if __name__ == '__main__':
try:
opts, args = getopt.getopt(sys.argv[1:], 'estgvhcu')
except getopt.error, msg:
print >>sys.stderr, str(msg) + '\n\n' + __doc__
sys.exit()
options.verbose = False
runTestServer = False
setupTestServer = False
cleanupTestServer = False
useGzip = False
useExistingDB = False
for opt, arg in opts:
if opt == '-h':
print >>sys.stderr, __doc__
sys.exit()
elif opt == '-s':
setupTestServer = True
elif opt == '-e':
runTestServer = True
elif opt == '-t':
setupTestServer = True
runTestServer = True
elif opt == '-c':
cleanupTestServer = True
elif opt == '-v':
options.verbose = True
elif opt == '-g':
useGzip = True
elif opt == '-u':
useExistingDB = True
elif opt == '-v':
options.verbose = True
if setupTestServer:
setupTest(useGzip)
if runTestServer:
runTest(useGzip)
elif cleanupTestServer:
cleanupTest()
else:
print >>sys.stderr, __doc__
--- NEW FILE: Histogram.py ---
#! /usr/bin/env python
import math
from spambayes.Options import options
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
class Hist:
"""Simple histograms of float values."""
# Pass None for lo and hi and it will automatically adjust to the min
# and max values seen.
# Note: nbuckets can be passed for backward compatibility. The
# display() method can be passed a different nbuckets value.
def __init__(self, nbuckets=options.nbuckets, lo=0.0, hi=100.0):
self.lo, self.hi = lo, hi
self.nbuckets = nbuckets
self.buckets = [0] * nbuckets
self.data = [] # the raw data points
self.stats_uptodate = False
# Add a value to the collection.
def add(self, x):
self.data.append(x)
self.stats_uptodate = False
# Compute, and set as instance attrs:
# n # of data points
# The rest are set iff n>0:
# min smallest value in collection
# max largest value in collection
# median midpoint
# mean
# pct list of (percentile, score) pairs
# var variance
# sdev population standard deviation (sqrt(variance))
# self.data is also sorted.
def compute_stats(self):
if self.stats_uptodate:
return
self.stats_uptodate = True
data = self.data
n = self.n = len(data)
if n == 0:
return
data.sort()
self.min = data[0]
self.max = data[-1]
if n & 1:
self.median = data[n // 2]
else:
self.median = (data[n // 2] + data[(n-1) // 2]) / 2.0
# Compute mean.
# Add in increasing order of magnitude, to minimize roundoff error.
if data[0] < 0.0:
temp = [(abs(x), x) for x in data]
temp.sort()
data = [x[1] for x in temp]
del temp
sum = 0.0
for x in data:
sum += x
mean = self.mean = sum / n
# Compute variance.
var = 0.0
for x in data:
d = x - mean
var += d*d
self.var = var / n
self.sdev = math.sqrt(self.var)
# Compute percentiles.
self.pct = pct = []
for p in options.percentiles:
assert 0.0 <= p <= 100.0
# In going from data index 0 to index n-1, we move n-1 times.
# p% of that is (n-1)*p/100.
i = (n-1)*p/1e2
if i < 0:
# Just return the smallest.
score = data[0]
else:
whole = int(i)
frac = i - whole
score = data[whole]
if whole < n-1 and frac:
# Move frac of the way from this score to the next.
score += frac * (data[whole + 1] - score)
pct.append((p, score))
# Merge other into self.
def __iadd__(self, other):
self.data.extend(other.data)
self.stats_uptodate = False
return self
def get_lo_hi(self):
self.compute_stats()
lo, hi = self.lo, self.hi
if lo is None:
lo = self.min
if hi is None:
hi = self.max
return lo, hi
def get_bucketwidth(self):
lo, hi = self.get_lo_hi()
span = float(hi - lo)
return span / self.nbuckets
# Set instance var nbuckets to the # of buckets, and buckets to a list
# of nbuckets counts.
def fill_buckets(self, nbuckets=None):
if nbuckets is None:
nbuckets = self.nbuckets
if nbuckets <= 0:
raise ValueError("nbuckets %g > 0 required" % nbuckets)
self.nbuckets = nbuckets
self.buckets = buckets = [0] * nbuckets
# Compute bucket counts.
lo, hi = self.get_lo_hi()
bucketwidth = self.get_bucketwidth()
for x in self.data:
i = int((x - lo) / bucketwidth)
if i >= nbuckets:
i = nbuckets - 1
elif i < 0:
i = 0
buckets[i] += 1
# Print a histogram to stdout.
# Also sets instance var nbuckets to the # of buckets, and
# buckts to a list of nbuckets counts, but only if at least one
# data point is in the collection.
def display(self, nbuckets=None, WIDTH=61):
if nbuckets is None:
nbuckets = self.nbuckets
if nbuckets <= 0:
raise ValueError("nbuckets %g > 0 required" % nbuckets)
self.compute_stats()
n = self.n
if n == 0:
return
print "%d items; mean %.2f; sdev %.2f" % (n, self.mean, self.sdev)
print "-> <stat> min %g; median %g; max %g" % (self.min,
self.median,
self.max)
pcts = ['%g%% %g' % x for x in self.pct]
print "-> <stat> percentiles:", '; '.join(pcts)
lo, hi = self.get_lo_hi()
if lo > hi:
return
# hunit is how many items a * represents. A * is printed for
# each hunit items, plus any non-zero fraction thereof.
self.fill_buckets(nbuckets)
biggest = max(self.buckets)
hunit, r = divmod(biggest, WIDTH)
if r:
hunit += 1
print "* =", hunit, "items"
# We need ndigits decimal digits to display the largest bucket count.
ndigits = len(str(biggest))
# Displaying the bucket boundaries is more troublesome.
bucketwidth = self.get_bucketwidth()
whole_digits = max(len(str(int(lo))),
len(str(int(hi - bucketwidth))))
frac_digits = 0
while bucketwidth < 1.0:
# Incrementing by bucketwidth may not change the last displayed
# digit, so display one more.
frac_digits += 1
bucketwidth *= 10.0
format = ("%" + str(whole_digits + 1 + frac_digits) + '.' +
str(frac_digits) + 'f %' + str(ndigits) + "d")
bucketwidth = self.get_bucketwidth()
for i in range(nbuckets):
n = self.buckets[i]
print format % (lo + i * bucketwidth, n),
print '*' * ((n + hunit - 1) // hunit)
--- NEW FILE: Options.py ---
# Options.options is a globally shared options object.
# XXX As this code is, option names must be unique across ini sections,
# XXX and must not conflict with OptionsClass method names.
import sys, os
import StringIO
import ConfigParser
from sets import Set
try:
True, False, bool
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
def bool(val):
return not not val
__all__ = ['options']
defaults = """
[Tokenizer]
# If true, tokenizer.Tokenizer.tokenize_headers() will tokenize the
# contents of each header field just like the text of the message
# body, using the name of the header as a tag. Tokens look like
# "header:word". The basic approach is simple and effective, but also
# very sensitive to biases in the ham and spam collections. For
# example, if the ham and spam were collected at different times,
# several headers with date/time information will become the best
# discriminators. (Not just Date, but Received and X-From_.)
basic_header_tokenize: False
# If true and basic_header_tokenize is also true, then
# basic_header_tokenize is the only action performed.
basic_header_tokenize_only: False
# If basic_header_tokenize is true, then basic_header_skip is a set of
# headers that should be skipped.
basic_header_skip: received
date
x-.*
# If true, the first few characters of application/octet-stream sections
# are used, undecoded. What 'few' means is decided by octet_prefix_size.
check_octets: False
octet_prefix_size: 5
# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
#
# Depending on data collection, some headers are not safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
# very strong ham clue, but a bogus one. In that case, set
# count_all_header_lines to False, and adjust safe_headers instead.
count_all_header_lines: False
# When True, generate a "noheader:HEADERNAME" token for each header in
# safe_headers (below) that *doesn't* appear in the headers. This helped
# in various of Tim's python.org tests, but appeared to hurt a little in
# Anthony Baxter's tests.
record_header_absence: False
# Like count_all_header_lines, but restricted to headers in this list.
# safe_headers is ignored when count_all_header_lines is true, unless
# record_header_absence is also true.
safe_headers: abuse-reports-to
date
errors-to
from
importance
in-reply-to
message-id
mime-version
organization
received
reply-to
return-path
subject
to
user-agent
x-abuse-info
x-complaints-to
x-face
# A lot of clues can be gotten from IP addresses and names in Received:
# headers. Again this can give spectacular results for bogus reasons
# if your test corpora are from different sources. Else set this to true.
mine_received_headers: False
# Mine the following address headers. If you have mixed source corpuses
# (as opposed to a mixed sauce walrus, which is delicious!) then you
# probably don't want to use 'to' or 'cc')
# Address headers will be decoded, and will generate charset tokens as
# well as the real address.
# others to consider: to, cc, reply-to, errors-to, sender, ...
address_headers: from
# If legitimate mail contains things that look like text to the tokenizer
# and turning turning off this option helps (perhaps binary attachments get
# 'defanged' by something upstream from this operation and thus look like
# text), this may help, and should be an alert that perhaps the tokenizer is
# broken.
generate_long_skips: True
# Try to capitalize on mail sent to multiple similar addresses.
summarize_email_prefixes: False
#
# Length of words that triggers 'long skips'. Longer than this
# triggers a skip.
#
skip_max_word_size: 12
# Generate tokens which resemble the posting time in 10-minute buckets:
# 'time:' hour ':' minute//10
generate_time_buckets: False
# Extract day of the week tokens from the Date: header.
extract_dow: False
# If true, replace high-bit characters (ord(c) >= 128) and control characters
# with question marks. This allows non-ASCII character strings to be
# identified with little training and small database burden. It's appropriate
# only if your ham is plain 7-bit ASCII, or nearly so, so that the mere
# presence of non-ASCII character strings is known in advance to be a strong
# spam indicator.
replace_nonascii_chars: False
[TestDriver]
# These control various displays in class TestDriver.Driver, and Tester.Test.
# spam_cutoff and ham_cutoff are used in Python slice sense:
# A msg is considered ham if its score is in 0:ham_cutoff
# A msg is considered unsure if its score is in ham_cutoff:spam_cutoff
# A msg is considered spam if its score is in spam_cutoff:
#
# So it's unsure iff ham_cutoff <= score < spam_cutoff.
# For a binary classifier, make ham_cutoff == spam_cutoff.
# ham_cutoff > spam_cutoff doesn't make sense.
#
# The defaults here (.2 and .9) may be appropriate for the default chi-
# combining scheme. Cutoffs for chi-combining typically aren't touchy,
# provided you're willing to settle for "really good" instead of "optimal".
# Tim found that .3 and .8 worked very well for well-trained systems on
# his personal email, and his large comp.lang.python test. If just beginning
# training, or extremely fearful of mistakes, 0.05 and 0.95 may be more
# appropriate for you.
#
# Picking good values for gary-combining is much harder, and appears to be
# corpus-dependent, and within a single corpus dependent on how much
# training has been done. Values from 0.50 thru the low 0.60's have been
# reported to work best by various testers on their data.
ham_cutoff: 0.20
spam_cutoff: 0.90
# Number of buckets in histograms.
nbuckets: 200
show_histograms: True
# After the display of a ham+spam histogram pair, you can get a listing of
# all the cutoff values (coinciding with histogram bucket boundaries) that
# minimize
#
# best_cutoff_fp_weight * (# false positives) +
# best_cutoff_fn_weight * (# false negatives) +
# best_cutoff_unsure_weight * (# unsure msgs)
#
# This displays two cutoffs: hamc and spamc, where
#
# 0.0 <= hamc <= spamc <= 1.0
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
# called 'I am not sure' -- the middle ground.
#
# Note: You may wish to increase nbuckets, to give this scheme more cutoff
# values to analyze.
compute_best_cutoffs_from_histograms: True
best_cutoff_fp_weight: 10.00
best_cutoff_fn_weight: 1.00
best_cutoff_unsure_weight: 0.20
# Histogram analysis also displays percentiles. For each percentile p
# in the list, the score S such that p% of all scores are <= S is given.
# Note that percentile 50 is the median, and is displayed (along with the
# min score and max score) independent of this option.
percentiles: 5 25 75 95
# Display spam when
# show_spam_lo <= spamprob <= show_spam_hi
# and likewise for ham. The defaults here do not show anything.
show_spam_lo: 1.0
show_spam_hi: 0.0
show_ham_lo: 1.0
show_ham_hi: 0.0
show_false_positives: True
show_false_negatives: False
show_unsure: False
# The maximum # of characters to display for a msg displayed due to the
# show_xyz options above.
show_charlimit: 3000
# If save_trained_pickles is true, Driver.train() saves a binary pickle
# of the classifier after training. The file basename is given by
# pickle_basename, the extension is .pik, and increasing integers are
# appended to pickle_basename. By default (if save_trained_pickles is
# true), the filenames are class1.pik, class2.pik, ... If a file of that
# name already exists, it is overwritten. pickle_basename is ignored when
# save_trained_pickles is false.
# if save_histogram_pickles is true, Driver.train() saves a binary
# pickle of the spam and ham histogram for "all test runs". The file
# basename is given by pickle_basename, the suffix _spamhist.pik
# or _hamhist.pik is appended to the basename.
save_trained_pickles: False
pickle_basename: class
save_histogram_pickles: False
# default locations for timcv and timtest - these get the set number
# interpolated.
spam_directories: Data/Spam/Set%d
ham_directories: Data/Ham/Set%d
[CV Driver]
# A cross-validation driver takes N ham+spam sets, and builds N classifiers,
# training each on N-1 sets, and the predicting against the set not trained
# on. By default, it does this in a clever way, learning *and* unlearning
# sets as it goes along, so that it never needs to train on N-1 sets in one
# gulp after the first time. Setting this option true forces ''one gulp
# from-scratch'' training every time. There used to be a set of combining
# schemes that needed this, but now it is just in case you are paranoid <wink>.
build_each_classifier_from_scratch: False
[Classifier]
# The maximum number of extreme words to look at in a msg, where "extreme"
# means with spamprob farthest away from 0.5. 150 appears to work well
# across all corpora tested.
max_discriminators: 150
# These two control the prior assumption about word probabilities.
# unknown_word_prob is essentially the probability given to a word that
# has never been seen before. Nobody has reported an improvement via moving
# it away from 1/2, although Tim has measured a mean spamprob of a bit over
# 0.5 (0.51-0.55) in 3 well-trained classifiers.
#
# unknown_word_strength adjusts how much weight to give the prior assumption
# relative to the probabilities estimated by counting. At 0, the counting
# estimates are believed 100%, even to the extent of assigning certainty
# (0 or 1) to a word that has appeared in only ham or only spam. This
# is a disaster.
#
# As unknown_word_strength tends toward infintity, all probabilities tend
# toward unknown_word_prob. All reports were that a value near 0.4 worked
# best, so this does not seem to be corpus-dependent.
unknown_word_prob: 0.5
unknown_word_strength: 0.45
# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
# tests. 0.1 appeared to work well across all corpora.
minimum_prob_strength: 0.1
# The combining scheme currently detailed on the Robinon web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data. It almost never gives extreme
# scores (near 0.0 or 1.0), but the tail ends of the ham and spam
# distributions overlap.
use_gary_combining: False
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
# follows the chi-squared distribution with 2*n degrees of freedom. This is
# the "provably most-sensitive" test the original scheme was monotonic
# with. Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
# lots of the mistakes live. This is the best method so far.
# One systematic benefit is is immunity to "cancellation disease". One
# systematic drawback is sensitivity to *any* deviation from a
# uniform distribution, regardless of whether actually evidence of
# ham or spam. Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
# In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95,
# does well across test sets; while these cutoffs are rarely optimal, they
# get close to optimal. With more training data, Tim has had good luck
# with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
# (original c.l.p data, his own email, and newer general python.org traffic).
use_chi_squared_combining: True
# If the # of ham and spam in training data are out of balance, the
# spamprob guesses can get stronger in the direction of the category with
# more training msgs. In one sense this must be so, since the more data
# we have of one flavor, the more we know about that flavor. But that
# allows the accidental appearance of a strong word of that flavor in a msg
# of the other flavor much more power than an accident in the other
# direction. Enable experimental_ham_spam_imbalance_adjustment if you have
# more ham than spam training data (or more spam than ham), and the
# Bayesian probability adjustment won't 'believe' raw counts more than
# min(# ham trained on, # spam trained on) justifies. I *expect* this
# option will go away (and become the default), but people *with* strong
# imbalance need to test it first.
experimental_ham_spam_imbalance_adjustment: False
[Hammie]
# The name of the header that hammie adds to an E-mail in filter mode
# It contains the "classification" of the mail, plus the score.
hammie_header_name: X-Spambayes-Classification
# The three disposition names are added to the header as the following
# Three words:
header_spam_string: spam
header_ham_string: ham
header_unsure_string: unsure
# Accuracy of the score in the header in decimal digits
header_score_digits: 2
# Set this to "True", to augment scores of 1.00 or 0.00 by a logarithmic
# "one-ness" or "zero-ness" score (basically it shows the "number of zeros"
# or "number of nines" next to the score value).
header_score_logarithm: False
# Enable debugging information in the header.
hammie_debug_header: False
# Name of a debugging header for spambayes hackers, showing the strongest
# clues that have resulted in the classification in the standard header.
hammie_debug_header_name: X-Hammie-Debug
# The range of clues that are added to the "debug" header in the E-mail
# All clues that have their probability smaller than this number, or larger
# than one minus this number are added to the header such that you can see
# why spambayes thinks this is ham/spam or why it is unsure. The default is
# to show all clues, but you can reduce that by setting showclue to a lower
# value, such as 0.1
clue_mailheader_cutoff: 0.5
[hammiefilter]
# hammiefilter can use either a database (quick to score one message) or
# a pickle (quick to train on huge amounts of messages). Set this to
# True to use a database by default.
hammiefilter_persistent_use_database: True
hammiefilter_persistent_storage_file: ~/.hammiedb
[pop3proxy]
# pop3proxy settings - pop3proxy also respects the options in the Hammie
# section, with the exception of the extra header details at the moment.
# The only mandatory option is pop3proxy_servers, eg. "pop3.my-isp.com:110",
# or a comma-separated list of those. The ":110" is optional. If you
# specify more than one server in pop3proxy_servers, you must specify the
# same number of ports in pop3proxy_ports.
pop3proxy_servers:
pop3proxy_ports:
pop3proxy_cache_use_gzip: False
pop3proxy_cache_expiry_days: 7
pop3proxy_spam_cache: pop3proxy-spam-cache
pop3proxy_ham_cache: pop3proxy-ham-cache
pop3proxy_unknown_cache: pop3proxy-unknown-cache
pop3proxy_persistent_use_database: False
pop3proxy_persistent_storage_file: hammie.db
# Deprecated - use pop3proxy_servers and pop3proxy_ports instead.
pop3proxy_server_name:
pop3proxy_server_port: 110
pop3proxy_port: 110
[html_ui]
html_ui_port: 8880
html_ui_launch_browser: False
[globals]
verbose: False
# What DBM storage type should we use? Must be best, db3hash, dbhash,
# gdbm, dumbdbm. Windows folk should steer clear of dbhash. Default is
# "best", which will pick the best DBM type available on your platform.
dbm_type: best
"""
int_cracker = ('getint', None)
float_cracker = ('getfloat', None)
boolean_cracker = ('getboolean', bool)
string_cracker = ('get', None)
all_options = {
'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
'address_headers': ('get', lambda s: Set(s.split())),
'count_all_header_lines': boolean_cracker,
'record_header_absence': boolean_cracker,
'generate_long_skips': boolean_cracker,
'summarize_email_prefixes': boolean_cracker,
'skip_max_word_size': int_cracker,
'extract_dow': boolean_cracker,
'generate_time_buckets': boolean_cracker,
'mine_received_headers': boolean_cracker,
'check_octets': boolean_cracker,
'octet_prefix_size': int_cracker,
'basic_header_tokenize': boolean_cracker,
'basic_header_tokenize_only': boolean_cracker,
'basic_header_skip': ('get', lambda s: Set(s.split())),
'replace_nonascii_chars': boolean_cracker,
},
'TestDriver': {'nbuckets': int_cracker,
'show_ham_lo': float_cracker,
'show_ham_hi': float_cracker,
'show_spam_lo': float_cracker,
'show_spam_hi': float_cracker,
'show_false_positives': boolean_cracker,
'show_false_negatives': boolean_cracker,
'show_unsure': boolean_cracker,
'show_histograms': boolean_cracker,
'percentiles': ('get', lambda s: map(float, s.split())),
'save_trained_pickles': boolean_cracker,
'save_histogram_pickles': boolean_cracker,
'pickle_basename': string_cracker,
'show_charlimit': int_cracker,
'ham_cutoff': float_cracker,
'spam_cutoff': float_cracker,
'spam_directories': string_cracker,
'ham_directories': string_cracker,
'compute_best_cutoffs_from_histograms': boolean_cracker,
'best_cutoff_fp_weight': float_cracker,
'best_cutoff_fn_weight': float_cracker,
'best_cutoff_unsure_weight': float_cracker,
},
'CV Driver': {'build_each_classifier_from_scratch': boolean_cracker,
},
'Classifier': {'max_discriminators': int_cracker,
'unknown_word_prob': float_cracker,
'unknown_word_strength': float_cracker,
'minimum_prob_strength': float_cracker,
'use_gary_combining': boolean_cracker,
'use_chi_squared_combining': boolean_cracker,
'experimental_ham_spam_imbalance_adjustment': boolean_cracker,
},
'Hammie': {'hammie_header_name': string_cracker,
'clue_mailheader_cutoff': float_cracker,
'persistent_use_database': boolean_cracker,
'header_spam_string': string_cracker,
'header_unsure_string': string_cracker,
'header_ham_string': string_cracker,
'header_score_digits': int_cracker,
'header_score_logarithm': boolean_cracker,
'hammie_debug_header': boolean_cracker,
'hammie_debug_header_name': string_cracker,
},
'hammiefilter' : {'hammiefilter_persistent_use_database': boolean_cracker,
'hammiefilter_persistent_storage_file': string_cracker,
},
'pop3proxy': {'pop3proxy_servers': string_cracker,
'pop3proxy_ports': string_cracker,
'pop3proxy_server_name': string_cracker,
'pop3proxy_server_port': int_cracker,
'pop3proxy_port': int_cracker,
'pop3proxy_cache_use_gzip': boolean_cracker,
'pop3proxy_cache_expiry_days': int_cracker,
'pop3proxy_spam_cache': string_cracker,
'pop3proxy_ham_cache': string_cracker,
'pop3proxy_unknown_cache': string_cracker,
'pop3proxy_persistent_use_database': boolean_cracker,
'pop3proxy_persistent_storage_file': string_cracker,
},
'html_ui': {'html_ui_port': int_cracker,
'html_ui_launch_browser': boolean_cracker,
},
'globals': {'verbose': boolean_cracker,
'dbm_type': string_cracker,
},
}
def _warn(msg):
print >> sys.stderr, msg
class OptionsClass(object):
def __init__(self):
self._config = ConfigParser.ConfigParser()
def mergefiles(self, fnamelist):
self._config.read(fnamelist)
self._update()
def mergefilelike(self, filelike):
self._config.readfp(filelike)
self._update()
def _update(self):
nerrors = 0
c = self._config
for section in c.sections():
if section not in all_options:
_warn("config file has unknown section %r" % section)
nerrors += 1
continue
goodopts = all_options[section]
for option in c.options(section):
if option not in goodopts:
_warn("config file has unknown option %r in "
"section %r" % (option, section))
nerrors += 1
continue
fetcher, converter = goodopts[option]
value = getattr(c, fetcher)(section, option)
if converter is not None:
value = converter(value)
setattr(options, option, value)
if nerrors:
raise ValueError("errors while parsing .ini file")
def display(self):
output = StringIO.StringIO()
self._config.write(output)
return output.getvalue()
options = OptionsClass()
d = StringIO.StringIO(defaults)
options.mergefilelike(d)
del d
alternate = None
if hasattr(os, 'getenv'):
alternate = os.getenv('BAYESCUSTOMIZE')
if alternate:
options.mergefiles(alternate.split())
else:
options.mergefiles(['bayescustomize.ini'])
--- NEW FILE: TestDriver.py ---
# Loop:
# Optional:
# # Set up a new base classifier for testing.
# new_classifier(), or set_classifier()
# # Run tests against (possibly variants of) this classifier.
# Loop:
# Loop:
# Optional:
# # train on more ham and spam
# train(ham, spam)
# Optional:
# # Forget training for some subset of ham and spam.
# untrain(ham, spam)
# # Predict against other data.
# Loop:
# test(ham, spam)
# # Display stats against all runs on this classifier variant.
# # This also saves the trained classifer, if desired (option
# # save_trained_pickles).
# finishtest()
# # Display stats against all runs.
# alldone()
from sets import Set
import cPickle as pickle
from heapq import heapreplace
from spambayes.Options import options
from spambayes import Tester
from spambayes import classifier
from spambayes.Histogram import Hist
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
def printhist(tag, ham, spam, nbuckets=options.nbuckets):
print
print "-> <stat> Ham scores for", tag,
ham.display(nbuckets)
print
print "-> <stat> Spam scores for", tag,
spam.display(nbuckets)
if not options.compute_best_cutoffs_from_histograms:
return
if ham.n == 0 or spam.n == 0:
return
# Figure out "the best" ham & spam cutoff points, meaning the ones that
# minimize
# num_fp * fp_weight + num_fn + fn_weight + num_unsure * unsure_weight
# the total number of misclassified msgs (other definitions are
# certainly possible!).
# At cutoff 0, everything is called spam, so there are no false negatives,
# and every ham is a false positive.
assert ham.nbuckets == spam.nbuckets
n = ham.nbuckets
FPW = options.best_cutoff_fp_weight
FNW = options.best_cutoff_fn_weight
UNW = options.best_cutoff_unsure_weight
# Get running totals: {h,s}total[i] is # of ham/spam below bucket i
htotal = [0] * (n+1)
stotal = [0] * (n+1)
for i in range(1, n+1):
htotal[i] = htotal[i-1] + ham.buckets[i-1]
stotal[i] = stotal[i-1] + spam.buckets[i-1]
assert htotal[-1] == ham.n
assert stotal[-1] == spam.n
best_cost = 1e200 # infinity
bests = [] # best h and s cutoffs
for h in range(n+1):
num_fn = stotal[h]
fn_cost = num_fn * FNW
for s in xrange(h, n+1):
# ham 0:h correct
# h:s unsure
# s: FP
# spam 0:h FN
# h:s unsure
# s: correct
num_fp = htotal[-1] - htotal[s]
num_un = htotal[s] - htotal[h] + stotal[s] - stotal[h]
cost = num_fp * FPW + fn_cost + num_un * UNW
if cost <= best_cost:
if cost < best_cost:
best_cost = cost
bests = []
bests.append((h, s))
print '-> best cost for %s $%.2f' % (tag, best_cost)
print '-> per-fp cost $%.2f; per-fn cost $%.2f; per-unsure cost $%.2f' % (
FPW, FNW, UNW)
if len(bests) > 1:
print '-> achieved at', len(bests), 'cutoff pairs'
info = [('smallest ham & spam cutoffs', bests[0]),
('largest ham & spam cutoffs', bests[-1])]
else:
info = [('achieved at ham & spam cutoffs', bests[0])]
for tag, (h, s) in info:
print '-> %s %g & %g' % (tag, float(h)/n, float(s)/n)
num_fn = stotal[h]
num_fp = htotal[-1] - htotal[s]
num_unh = htotal[s] - htotal[h]
num_uns = stotal[s] - stotal[h]
print '-> fp %d; fn %d; unsure ham %d; unsure spam %d' % (
num_fp, num_fn, num_unh, num_uns)
print '-> fp rate %.3g%%; fn rate %.3g%%; unsure rate %.3g%%' % (
num_fp*1e2 / ham.n, num_fn*1e2 / spam.n,
(num_unh + num_uns)*1e2 / (ham.n + spam.n))
return float(bests[0][0])/n,float(bests[0][1])/n
def printmsg(msg, prob, clues):
print msg.tag
print "prob =", prob
for clue in clues:
print "prob(%r) = %g" % clue
print
guts = str(msg)
if options.show_charlimit > 0:
guts = guts[:options.show_charlimit]
print guts
class Driver:
def __init__(self):
self.falsepos = Set()
self.falseneg = Set()
self.unsure = Set()
self.global_ham_hist = Hist()
self.global_spam_hist = Hist()
self.ntimes_finishtest_called = 0
self.new_classifier()
from spambayes import CostCounter
self.cc=CostCounter.default()
def new_classifier(self):
"""Create and use a new, virgin classifier."""
self.set_classifier(classifier.Bayes())
def set_classifier(self, classifier):
"""Specify a classifier to be used for further testing."""
self.classifier = classifier
self.tester = Tester.Test(classifier)
self.trained_ham_hist = Hist()
self.trained_spam_hist = Hist()
def train(self, ham, spam):
print "-> Training on", ham, "&", spam, "...",
c = self.classifier
nham, nspam = c.nham, c.nspam
self.tester.train(ham, spam)
print c.nham - nham, "hams &", c.nspam- nspam, "spams"
def untrain(self, ham, spam):
print "-> Forgetting", ham, "&", spam, "...",
c = self.classifier
nham, nspam = c.nham, c.nspam
self.tester.untrain(ham, spam)
print nham - c.nham, "hams &", nspam - c.nspam, "spams"
def finishtest(self):
if options.show_histograms:
printhist("all in this training set:",
self.trained_ham_hist, self.trained_spam_hist)
self.global_ham_hist += self.trained_ham_hist
self.global_spam_hist += self.trained_spam_hist
self.trained_ham_hist = Hist()
self.trained_spam_hist = Hist()
self.ntimes_finishtest_called += 1
if options.save_trained_pickles:
fname = "%s%d.pik" % (options.pickle_basename,
self.ntimes_finishtest_called)
print " saving pickle to", fname
fp = file(fname, 'wb')
pickle.dump(self.classifier, fp, 1)
fp.close()
def alldone(self):
if options.show_histograms:
besthamcut,bestspamcut = printhist("all runs:",
self.global_ham_hist,
self.global_spam_hist)
else:
besthamcut = options.ham_cutoff
bestspamcut = options.spam_cutoff
nham = self.global_ham_hist.n
nspam = self.global_spam_hist.n
nfp = len(self.falsepos)
nfn = len(self.falseneg)
nun = len(self.unsure)
print "-> <stat> all runs false positives:", nfp
print "-> <stat> all runs false negatives:", nfn
print "-> <stat> all runs unsure:", nun
print "-> <stat> all runs false positive %:", (nfp * 1e2 / nham)
print "-> <stat> all runs false negative %:", (nfn * 1e2 / nspam)
print "-> <stat> all runs unsure %:", (nun * 1e2 / (nham + nspam))
print "-> <stat> all runs cost: $%.2f" % (
nfp * options.best_cutoff_fp_weight +
nfn * options.best_cutoff_fn_weight +
nun * options.best_cutoff_unsure_weight)
# Set back the options for the delayed calculations in self.cc
options.ham_cutoff = besthamcut
options.spam_cutoff = bestspamcut
print self.cc
if options.save_histogram_pickles:
for f, h in (('ham', self.global_ham_hist),
('spam', self.global_spam_hist)):
fname = "%s_%shist.pik" % (options.pickle_basename, f)
print " saving %s histogram pickle to %s" %(f, fname)
fp = file(fname, 'wb')
pickle.dump(h, fp, 1)
fp.close()
def test(self, ham, spam):
c = self.classifier
t = self.tester
local_ham_hist = Hist()
local_spam_hist = Hist()
def new_ham(msg, prob, lo=options.show_ham_lo,
hi=options.show_ham_hi):
local_ham_hist.add(prob * 100.0)
self.cc.ham(prob)
if lo <= prob <= hi:
print
print "Ham with prob =", prob
prob, clues = c.spamprob(msg, True)
printmsg(msg, prob, clues)
def new_spam(msg, prob, lo=options.show_spam_lo,
hi=options.show_spam_hi):
local_spam_hist.add(prob * 100.0)
self.cc.spam(prob)
if lo <= prob <= hi:
print
print "Spam with prob =", prob
prob, clues = c.spamprob(msg, True)
printmsg(msg, prob, clues)
t.reset_test_results()
print "-> Predicting", ham, "&", spam, "..."
t.predict(spam, True, new_spam)
t.predict(ham, False, new_ham)
print "-> <stat> tested", t.nham_tested, "hams &", t.nspam_tested, \
"spams against", c.nham, "hams &", c.nspam, "spams"
print "-> <stat> false positive %:", t.false_positive_rate()
print "-> <stat> false negative %:", t.false_negative_rate()
print "-> <stat> unsure %:", t.unsure_rate()
print "-> <stat> cost: $%.2f" % (
t.nham_wrong * options.best_cutoff_fp_weight +
t.nspam_wrong * options.best_cutoff_fn_weight +
(t.nham_unsure + t.nspam_unsure) *
options.best_cutoff_unsure_weight)
newfpos = Set(t.false_positives()) - self.falsepos
self.falsepos |= newfpos
print "-> <stat> %d new false positives" % len(newfpos)
if newfpos:
print " new fp:", [e.tag for e in newfpos]
if not options.show_false_positives:
newfpos = ()
for e in newfpos:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues)
newfneg = Set(t.false_negatives()) - self.falseneg
self.falseneg |= newfneg
print "-> <stat> %d new false negatives" % len(newfneg)
if newfneg:
print " new fn:", [e.tag for e in newfneg]
if not options.show_false_negatives:
newfneg = ()
for e in newfneg:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues)
newunsure = Set(t.unsures()) - self.unsure
self.unsure |= newunsure
print "-> <stat> %d new unsure" % len(newunsure)
if newunsure:
print " new unsure:", [e.tag for e in newunsure]
if not options.show_unsure:
newunsure = ()
for e in newunsure:
print '*' * 78
prob, clues = c.spamprob(e, True)
printmsg(e, prob, clues)
if options.show_histograms:
printhist("this pair:", local_ham_hist, local_spam_hist)
self.trained_ham_hist += local_ham_hist
self.trained_spam_hist += local_spam_hist
--- NEW FILE: Tester.py ---
from spambayes.Options import options
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
class Test:
# Pass a classifier instance (an instance of Bayes).
# Loop:
# # Train the classifer with new ham and spam.
# train(ham, spam) # this implies reset_test_results
# Loop:
# Optional:
# # Possibly fiddle the classifier.
# set_classifier()
# # Forget smessages the classifier was trained on.
# untrain(ham, spam) # this implies reset_test_results
# Optional:
# reset_test_results()
# # Predict against (presumably new) examples.
# predict(ham, spam)
# Optional:
# suck out the results, via instance vrbls and
# false_negative_rate(), false_positive_rate(),
# false_negatives(), and false_positives()
def __init__(self, classifier):
self.set_classifier(classifier)
self.reset_test_results()
# Tell the tester which classifier to use.
def set_classifier(self, classifier):
self.classifier = classifier
def reset_test_results(self):
# The number of ham and spam instances tested.
self.nham_tested = self.nspam_tested = 0
# The number of test instances correctly and incorrectly classified.
self.nham_right = 0
self.nham_wrong = 0
self.nham_unsure = 0;
self.nspam_right = 0
self.nspam_wrong = 0
self.nspam_unsure = 0;
# Lists of bad predictions.
self.ham_wrong_examples = [] # False positives: ham called spam.
self.spam_wrong_examples = [] # False negatives: spam called ham.
self.unsure_examples = [] # ham and spam in middle ground
# Train the classifier on streams of ham and spam. Updates probabilities
# before returning, and resets test results.
def train(self, hamstream=None, spamstream=None):
self.reset_test_results()
learn = self.classifier.learn
if hamstream is not None:
for example in hamstream:
learn(example, False)
if spamstream is not None:
for example in spamstream:
learn(example, True)
# Untrain the classifier on streams of ham and spam. Updates
# probabilities before returning, and resets test results.
def untrain(self, hamstream=None, spamstream=None):
self.reset_test_results()
unlearn = self.classifier.unlearn
if hamstream is not None:
for example in hamstream:
unlearn(example, False)
if spamstream is not None:
for example in spamstream:
unlearn(example, True)
# Run prediction on each sample in stream. You're swearing that stream
# is entirely composed of spam (is_spam True), or of ham (is_spam False).
# Note that mispredictions are saved, and can be retrieved later via
# false_negatives (spam mistakenly called ham) and false_positives (ham
# mistakenly called spam). For this reason, you may wish to wrap examples
# in a little class that identifies the example in a useful way, and whose
# __iter__ produces a token stream for the classifier.
#
# If specified, callback(msg, spam_probability) is called for each
# msg in the stream, after the spam probability is computed.
def predict(self, stream, is_spam, callback=None):
guess = self.classifier.spamprob
for example in stream:
prob = guess(example)
if callback:
callback(example, prob)
is_ham_guessed = prob < options.ham_cutoff
is_spam_guessed = prob >= options.spam_cutoff
if is_spam:
self.nspam_tested += 1
if is_spam_guessed:
self.nspam_right += 1
elif is_ham_guessed:
self.nspam_wrong += 1
self.spam_wrong_examples.append(example)
else:
self.nspam_unsure += 1
self.unsure_examples.append(example)
else:
self.nham_tested += 1
if is_ham_guessed:
self.nham_right += 1
elif is_spam_guessed:
self.nham_wrong += 1
self.ham_wrong_examples.append(example)
else:
self.nham_unsure += 1
self.unsure_examples.append(example)
assert (self.nham_right + self.nham_wrong + self.nham_unsure ==
self.nham_tested)
assert (self.nspam_right + self.nspam_wrong + self.nspam_unsure ==
self.nspam_tested)
def false_positive_rate(self):
"""Percentage of ham mistakenly identified as spam, in 0.0..100.0."""
return self.nham_wrong * 1e2 / (self.nham_tested or 1)
def false_negative_rate(self):
"""Percentage of spam mistakenly identified as ham, in 0.0..100.0."""
return self.nspam_wrong * 1e2 / (self.nspam_tested or 1)
def unsure_rate(self):
return ((self.nham_unsure + self.nspam_unsure) * 1e2 /
((self.nham_tested + self.nspam_tested) or 1))
def false_positives(self):
return self.ham_wrong_examples
def false_negatives(self):
return self.spam_wrong_examples
def unsures(self):
return self.unsure_examples
class _Example:
def __init__(self, name, words):
self.name = name
self.words = words
def __iter__(self):
return iter(self.words)
_easy_test = """
>>> from spambayes.classifier import Bayes
>>> from spambayes.Options import options
>>> options.ham_cutoff = options.spam_cutoff = 0.5
>>> good1 = _Example('', ['a', 'b', 'c'])
>>> good2 = _Example('', ['a', 'b'])
>>> bad1 = _Example('', ['c', 'd'])
>>> t = Test(Bayes())
>>> t.train([good1, good2], [bad1])
>>> t.predict([_Example('goodham', ['a', 'b']),
... _Example('badham', ['d']) # FP
... ], False)
>>> t.predict([_Example('goodspam', ['d']),
... _Example('badspam1', ['a']), # FN
... _Example('badspam2', ['a', 'b']), # FN
... _Example('badspam3', ['d', 'a', 'b']) # FN
... ], True)
>>> t.nham_tested
2
>>> t.nham_right, t.nham_wrong
(1, 1)
>>> t.false_positive_rate()
50.0
>>> [e.name for e in t.false_positives()]
['badham']
>>> t.nspam_tested
4
>>> t.nspam_right, t.nspam_wrong
(1, 3)
>>> t.false_negative_rate()
75.0
>>> [e.name for e in t.false_negatives()]
['badspam1', 'badspam2', 'badspam3']
>>> [e.name for e in t.unsures()]
[]
>>> t.unsure_rate()
0.0
"""
__test__ = {'easy': _easy_test}
def _test():
import doctest, Tester
doctest.testmod(Tester)
if __name__ == '__main__':
_test()
--- NEW FILE: __init__.py ---
# package marker.
--- NEW FILE: cdb.py ---
#! /usr/bin/env python
"""
Dan Bernstein's CDB implemented in Python
see http://cr.yp.to/cdb.html
"""
from __future__ import generators
import os
import struct
import mmap
import sys
def uint32_unpack(buf):
return struct.unpack('<L', buf)[0]
def uint32_pack(n):
return struct.pack('<L', n)
CDB_HASHSTART = 5381
def cdb_hash(buf):
h = CDB_HASHSTART
for c in buf:
h = (h + (h << 5)) & 0xffffffffL
h ^= ord(c)
return h
class Cdb(object):
def __init__(self, fp):
self.fp = fp
fd = fp.fileno()
self.size = os.fstat(fd).st_size
self.map = mmap.mmap(fd, self.size, access=mmap.ACCESS_READ)
self.eod = uint32_unpack(self.map[:4])
self.findstart()
self.loop = 0 # number of hash slots searched under this key
# initialized if loop is nonzero
self.khash = 0
self.hpos = 0
self.hslots = 0
# initialized if findnext() returns 1
self.dpos = 0
self.dlen = 0
def close(self):
self.map.close()
def __iter__(self, fn=None):
len = 2048
ret = []
while len < self.eod:
klen, vlen = struct.unpack("<LL", self.map[len:len+8])
len += 8
key = self.map[len:len+klen]
len += klen
val = self.map[len:len+vlen]
len += vlen
if fn:
yield fn(key, val)
else:
yield (key, val)
def iteritems(self):
return self.__iter__()
def iterkeys(self):
return self.__iter__(lambda k,v: k)
def itervalues(self):
return self.__iter__(lambda k,v: v)
def items(self):
ret = []
for i in self.iteritems():
ret.append(i)
return ret
def keys(self):
ret = []
for i in self.iterkeys():
ret.append(i)
return ret
def values(self):
ret = []
for i in self.itervalues():
ret.append(i)
return ret
def findstart(self):
self.loop = 0
def read(self, n, pos):
# XXX add code for platforms without mmap
return self.map[pos:pos+n]
def match(self, key, pos):
if key == self.read(len(key), pos):
return 1
else:
return 0
def findnext(self, key):
if not self.loop:
u = cdb_hash(key)
buf = self.read(8, u << 3 & 2047)
self.hslots = uint32_unpack(buf[4:])
if not self.hslots:
raise KeyError
self.hpos = uint32_unpack(buf[:4])
self.khash = u
u >>= 8
u %= self.hslots
u <<= 3
self.kpos = self.hpos + u
while self.loop < self.hslots:
buf = self.read(8, self.kpos)
pos = uint32_unpack(buf[4:])
if not pos:
raise KeyError
self.loop += 1
self.kpos += 8
if self.kpos == self.hpos + (self.hslots << 3):
self.kpos = self.hpos
u = uint32_unpack(buf[:4])
if u == self.khash:
buf = self.read(8, pos)
u = uint32_unpack(buf[:4])
if u == len(key):
if self.match(key, pos + 8):
dlen = uint32_unpack(buf[4:])
dpos = pos + 8 + len(key)
return self.read(dlen, dpos)
raise KeyError
def __getitem__(self, key):
self.findstart()
return self.findnext(key)
def get(self, key, default=None):
self.findstart()
try:
return self.findnext(key)
except KeyError:
return default
def cdb_dump(infile):
"""dump a database in djb's cdbdump format"""
db = Cdb(infile)
for key,value in db.iteritems():
print "+%d,%d:%s->%s" % (len(key), len(value), key, value)
print
def cdb_make(outfile, items):
pos = 2048
tables = {} # { h & 255 : [(h, p)] }
# write keys and data
outfile.seek(pos)
for key, value in items:
outfile.write(uint32_pack(len(key)) + uint32_pack(len(value)))
h = cdb_hash(key)
outfile.write(key)
outfile.write(value)
tables.setdefault(h & 255, []).append((h, pos))
pos += 8 + len(key) + len(value)
final = ''
# write hash tables
for i in range(256):
entries = tables.get(i, [])
nslots = 2*len(entries)
final += uint32_pack(pos) + uint32_pack(nslots)
null = (0, 0)
table = [null] * nslots
for h, p in entries:
n = (h >> 8) % nslots
while table[n] is not null:
n = (n + 1) % nslots
table[n] = (h, p)
for h, p in table:
outfile.write(uint32_pack(h) + uint32_pack(p))
pos += 8
# write header (pointers to tables and their lengths)
outfile.flush()
outfile.seek(0)
outfile.write(final)
def test():
#db = Cdb(open("t"))
#print db['one']
#print db['two']
#print db['foo']
#print db['us']
#print db.get('ec')
#print db.get('notthere')
db = open('test.cdb', 'wb')
cdb_make(db,
[('one', 'Hello'),
('two', 'Goodbye'),
('foo', 'Bar'),
('us', 'United States'),
])
db.close()
db = Cdb(open("test.cdb", 'rb'))
print db['one']
print db['two']
print db['foo']
print db['us']
print db.get('ec')
print db.get('notthere')
if __name__ == '__main__':
test()
--- NEW FILE: chi2.py ---
import math as _math
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
def chi2Q(x2, v, exp=_math.exp, min=min):
"""Return prob(chisq >= x2, with v degrees of freedom).
v must be even.
"""
assert v & 1 == 0
# XXX If x2 is very large, exp(-m) will underflow to 0.
m = x2 / 2.0
sum = term = exp(-m)
for i in range(1, v//2):
term *= m / i
sum += term
# With small x2 and large v, accumulated roundoff error, plus error in
# the platform exp(), can cause this to spill a few ULP above 1.0. For
# example, chi2Q(100, 300) on my box has sum == 1.0 + 2.0**-52 at this
# point. Returning a value even a teensy bit over 1.0 is no good.
return min(sum, 1.0)
def normZ(z, sqrt2pi=_math.sqrt(2.0*_math.pi), exp=_math.exp):
"Return value of the unit Gaussian at z."
return exp(-z*z/2.0) / sqrt2pi
def normP(z):
"""Return area under the unit Gaussian from -inf to z.
This is the probability that a zscore is <= z.
"""
# This is very accurate in a fixed-point sense. For negative z of
# large magnitude (<= -8.3), it returns 0.0, essentially because
# P(-z) is, to machine precision, indistiguishable from 1.0 then.
# sum <- area from 0 to abs(z).
a = abs(float(z))
if a >= 8.3:
sum = 0.5
else:
sum2 = term = a * normZ(a)
z2 = a*a
sum = 0.0
i = 1.0
while sum != sum2:
sum = sum2
i += 2.0
term *= z2 / i
sum2 += term
if z >= 0:
result = 0.5 + sum
else:
result = 0.5 - sum
return result
def normIQ(p, sqrt=_math.sqrt, ln=_math.log):
"""Return z such that the area under the unit Gaussian from z to +inf is p.
Must have 0.0 <= p <= 1.0.
"""
assert 0.0 <= p <= 1.0
# This is a low-accuracy rational approximation from Abramowitz & Stegun.
# The absolute error is bounded by 3e-3.
flipped = False
if p > 0.5:
flipped = True
p = 1.0 - p
if p == 0.0:
z = 8.3
else:
t = sqrt(-2.0 * ln(p))
z = t - (2.30753 + .27061*t) / (1. + .99229*t + .04481*t**2)
if flipped:
z = -z
return z
def normIP(p):
"""Return z such that the area under the unit Gaussian from -inf to z is p.
Must have 0.0 <= p <= 1.0.
"""
z = normIQ(1.0 - p)
# One Newton step should double the # of good digits.
return z + (p - normP(z)) / normZ(z)
def main():
from spambayes.Histogram import Hist
import sys
class WrappedRandom:
# There's no way W-H is equidistributed in 50 dimensions, so use
# Marsaglia-wrapping to shuffle it more.
def __init__(self, baserandom=random.random, tabsize=513):
self.baserandom = baserandom
self.n = tabsize
self.tab = [baserandom() for i in range(tabsize)]
self.next = baserandom()
def random(self):
result = self.next
i = int(result * self.n)
self.next = self.tab[i]
self.tab[i] = self.baserandom()
return result
random = WrappedRandom().random
#from uni import uni as random
#print random
def judge(ps, ln=_math.log, ln2=_math.log(2), frexp=_math.frexp):
H = S = 1.0
Hexp = Sexp = 0
for p in ps:
S *= 1.0 - p
H *= p
if S < 1e-200:
S, e = frexp(S)
Sexp += e
if H < 1e-200:
H, e = frexp(H)
Hexp += e
S = ln(S) + Sexp * ln2
H = ln(H) + Hexp * ln2
n = len(ps)
S = 1.0 - chi2Q(-2.0 * S, 2*n)
H = 1.0 - chi2Q(-2.0 * H, 2*n)
return S, H, (S-H + 1.0) / 2.0
warp = 0
bias = 0.99
if len(sys.argv) > 1:
warp = int(sys.argv[1])
if len(sys.argv) > 2:
bias = float(sys.argv[2])
h = Hist(20, lo=0.0, hi=1.0)
s = Hist(20, lo=0.0, hi=1.0)
score = Hist(20, lo=0.0, hi=1.0)
for i in range(5000):
ps = [random() for j in range(50)]
s1, h1, score1 = judge(ps + [bias] * warp)
s.add(s1)
h.add(h1)
score.add(score1)
print "Result for random vectors of 50 probs, +", warp, "forced to", bias
# Should be uniformly distributed on all-random data.
print
print 'H',
h.display()
# Should be uniformly distributed on all-random data.
print
print 'S',
s.display()
# Distribution doesn't really matter.
print
print '(S-H+1)/2',
score.display()
def showscore(ps, ln=_math.log, ln2=_math.log(2), frexp=_math.frexp):
H = S = 1.0
Hexp = Sexp = 0
for p in ps:
S *= 1.0 - p
H *= p
if S < 1e-200:
S, e = frexp(S)
Sexp += e
if H < 1e-200:
H, e = frexp(H)
Hexp += e
S = ln(S) + Sexp * ln2
H = ln(H) + Hexp * ln2
n = len(ps)
probS = chi2Q(-2*S, 2*n)
probH = chi2Q(-2*H, 2*n)
print "P(chisq >= %10g | v=%3d) = %10g" % (-2*S, 2*n, probS)
print "P(chisq >= %10g | v=%3d) = %10g" % (-2*H, 2*n, probH)
S = 1.0 - probS
H = 1.0 - probH
score = (S-H + 1.0) / 2.0
print "spam prob", S
print " ham prob", H
print "(S-H+1)/2", score
if __name__ == '__main__':
import random
main()
--- NEW FILE: classifier.py ---
#! /usr/bin/env python
# An implementation of a Bayes-like spam classifier.
#
# Paul Graham's original description:
#
# http://www.paulgraham.com/spam.html
#
# A highly fiddled version of that can be retrieved from our CVS repository,
# via tag Last-Graham. This made many demonstrated improvements in error
# rates over Paul's original description.
#
# This code implements Gary Robinson's suggestions, the core of which are
# well explained on his webpage:
#
# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
#
# This is theoretically cleaner, and in testing has performed at least as
# well as our highly tuned Graham scheme did, often slightly better, and
# sometimes much better. It also has "a middle ground", which people like:
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct. The false positives
# and false negatives under Gary's basic scheme (use_gary_combining) generally
# score in a narrow range around the corpus's best spam_cutoff value.
# However, it doesn't appear possible to guess the best spam_cutoff value in
# advance, and it's touchy.
#
# The chi-combining scheme used by default here gets closer to the theoretical
# basis of Gary's combining scheme, and does give extreme scores, but also
# has a very useful middle ground (small # of msgs spread across a large range
# of scores, and good cutoff values aren't touchy).
#
# This implementation is due to Tim Peters et alia.
import math
from sets import Set
from spambayes.Options import options
from spambayes.chi2 import chi2Q
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
LN2 = math.log(2) # used frequently by chi-combining
PICKLE_VERSION = 5
class WordInfo(object):
# Invariant: For use in a classifier database, at least one of
# spamcount and hamcount must be non-zero.
def __init__(self):
self.__setstate__((0, 0))
def __repr__(self):
return "WordInfo%r" % repr((self.spamcount,
self.hamcount))
def __getstate__(self):
return (self.spamcount,
self.hamcount)
def __setstate__(self, t):
(self.spamcount, self.hamcount) = t
class Classifier:
# Defining __slots__ here made Jeremy's life needlessly difficult when
# trying to hook this all up to ZODB as a persistent object. There's
# no space benefit worth getting from slots in this class; slots were
# used solely to help catch errors earlier, when this code was changing
# rapidly.
#__slots__ = ('wordinfo', # map word to WordInfo record
# 'nspam', # number of spam messages learn() has seen
# 'nham', # number of non-spam messages learn() has seen
# )
# allow a subclass to use a different class for WordInfo
WordInfoClass = WordInfo
def __init__(self):
self.wordinfo = {}
self.probcache = {}
self.nspam = self.nham = 0
def __getstate__(self):
return (PICKLE_VERSION, self.wordinfo, self.nspam, self.nham)
def __setstate__(self, t):
if t[0] != PICKLE_VERSION:
raise ValueError("Can't unpickle -- version %s unknown" % t[0])
(self.wordinfo, self.nspam, self.nham) = t[1:]
self.probcache = {}
# spamprob() implementations. One of the following is aliased to
# spamprob, depending on option settings.
def gary_spamprob(self, wordstream, evidence=False):
"""Return best-guess probability that wordstream is spam.
wordstream is an iterable object producing words.
The return value is a float in [0.0, 1.0].
If optional arg evidence is True, the return value is a pair
probability, evidence
where evidence is a list of (word, probability) pairs.
"""
from math import frexp
# This combination method is due to Gary Robinson; see
# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
# The real P = this P times 2**Pexp. Likewise for Q. We're
# simulating unbounded dynamic float range by hand. If this pans
# out, *maybe* we should store logarithms in the database instead
# and just add them here. But I like keeping raw counts in the
# database (they're easy to understand, manipulate and combine),
# and there's no evidence that this simulation is a significant
# expense.
P = Q = 1.0
Pexp = Qexp = 0
clues = self._getclues(wordstream)
for prob, word, record in clues:
P *= 1.0 - prob
Q *= prob
if P < 1e-200: # move back into range
P, e = frexp(P)
Pexp += e
if Q < 1e-200: # move back into range
Q, e = frexp(Q)
Qexp += e
P, e = frexp(P)
Pexp += e
Q, e = frexp(Q)
Qexp += e
num_clues = len(clues)
if num_clues:
#P = 1.0 - P**(1./num_clues)
#Q = 1.0 - Q**(1./num_clues)
#
# (x*2**e)**n = x**n * 2**(e*n)
n = 1.0 / num_clues
P = 1.0 - P**n * 2.0**(Pexp * n)
Q = 1.0 - Q**n * 2.0**(Qexp * n)
# (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives
# ((P-Q)/(P+Q)+1)/2 =
# ((P-Q+P-Q)/(P+Q)/2 =
# (2*P/(P+Q)/2 =
# P/(P+Q)
prob = P/(P+Q)
else:
prob = 0.5
if evidence:
clues = [(w, p) for p, w, r in clues]
clues.sort(lambda a, b: cmp(a[1], b[1]))
return prob, clues
else:
return prob
if options.use_gary_combining:
spamprob = gary_spamprob
# Across vectors of length n, containing random uniformly-distributed
# probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution
# with 2*n degrees of freedom. This has been proven (in some
# appropriate sense) to be the most sensitive possible test for
# rejecting the hypothesis that a vector of probabilities is uniformly
# distributed. Gary Robinson's original scheme was monotonic *with*
# this test, but skipped the details. Turns out that getting closer
# to the theoretical roots gives a much sharper classification, with
# a very small (in # of msgs), but also very broad (in range of scores),
# "middle ground", where most of the mistakes live. In particular,
# this scheme seems immune to all forms of "cancellation disease": if
# there are many strong ham *and* spam clues, this reliably scores
# close to 0.5. Most other schemes are extremely certain then -- and
# often wrong.
def chi2_spamprob(self, wordstream, evidence=False):
"""Return best-guess probability that wordstream is spam.
wordstream is an iterable object producing words.
The return value is a float in [0.0, 1.0].
If optional arg evidence is True, the return value is a pair
probability, evidence
where evidence is a list of (word, probability) pairs.
"""
from math import frexp, log as ln
# We compute two chi-squared statistics, one for ham and one for
# spam. The sum-of-the-logs business is more sensitive to probs
# near 0 than to probs near 1, so the spam measure uses 1-p (so
# that high-spamprob words have greatest effect), and the ham
# measure uses p directly (so that lo-spamprob words have greatest
# effect).
#
# For optimization, sum-of-logs == log-of-product, and f.p.
# multiplication is a lot cheaper than calling ln(). It's easy
# to underflow to 0.0, though, so we simulate unbounded dynamic
# range via frexp. The real product H = this H * 2**Hexp, and
# likewise the real product S = this S * 2**Sexp.
H = S = 1.0
Hexp = Sexp = 0
clues = self._getclues(wordstream)
for prob, word, record in clues:
S *= 1.0 - prob
H *= prob
if S < 1e-200: # prevent underflow
S, e = frexp(S)
Sexp += e
if H < 1e-200: # prevent underflow
H, e = frexp(H)
Hexp += e
# Compute the natural log of the product = sum of the logs:
# ln(x * 2**i) = ln(x) + i * ln(2).
S = ln(S) + Sexp * LN2
H = ln(H) + Hexp * LN2
n = len(clues)
if n:
S = 1.0 - chi2Q(-2.0 * S, 2*n)
H = 1.0 - chi2Q(-2.0 * H, 2*n)
# How to combine these into a single spam score? We originally
# used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H). A
# systematic problem is that we could end up being near-certain
# a thing was (for example) spam, even if S was small, provided
# that H was much smaller.
# Rob Hooft stared at these problems and invented the measure
# we use now, the simpler S-H, scaled into [0., 1.].
prob = (S-H + 1.0) / 2.0
else:
prob = 0.5
if evidence:
clues = [(w, p) for p, w, r in clues]
clues.sort(lambda a, b: cmp(a[1], b[1]))
clues.insert(0, ('*S*', S))
clues.insert(0, ('*H*', H))
return prob, clues
else:
return prob
if options.use_chi_squared_combining:
spamprob = chi2_spamprob
def learn(self, wordstream, is_spam):
"""Teach the classifier by example.
wordstream is a word stream representing a message. If is_spam is
True, you're telling the classifier this message is definitely spam,
else that it's definitely not spam.
"""
self._add_msg(wordstream, is_spam)
def unlearn(self, wordstream, is_spam):
"""In case of pilot error, call unlearn ASAP after screwing up.
Pass the same arguments you passed to learn().
"""
self._remove_msg(wordstream, is_spam)
def probability(self, record):
"""Compute, store, and return prob(msg is spam | msg contains word).
This is the Graham calculation, but stripped of biases, and
stripped of clamping into 0.01 thru 0.99. The Bayesian
adjustment following keeps them in a sane range, and one
that naturally grows the more evidence there is to back up
a probability.
"""
spamcount = record.spamcount
hamcount = record.hamcount
# Try the cache first
try:
return self.probcache[spamcount][hamcount]
except KeyError:
pass
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
assert hamcount <= nham
hamratio = hamcount / nham
assert spamcount <= nspam
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
if options.experimental_ham_spam_imbalance_adjustment:
spam2ham = min(nspam / nham, 1.0)
ham2spam = min(nham / nspam, 1.0)
else:
spam2ham = ham2spam = 1.0
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
# Now do Robinson's Bayesian adjustment.
#
# s*x + n*p(w)
# f(w) = --------------
# s + n
#
# I find this easier to reason about like so (equivalent when
# s != 0):
#
# x - p
# p + -------
# 1 + n/s
#
# IOW, it moves p a fraction of the distance from p to x, and
# less so the larger n is, or the smaller s is.
# Experimental:
# Picking a good value for n is interesting: how much empirical
# evidence do we really have? If nham == nspam,
# hamcount + spamcount makes a lot of sense, and the code here
# does that by default.
# But if, e.g., nham is much larger than nspam, p(w) can get a
# lot closer to 0.0 than it can get to 1.0. That in turn makes
# strong ham words (high hamcount) much stronger than strong
# spam words (high spamcount), and that makes the accidental
# appearance of a strong ham word in spam much more damaging than
# the accidental appearance of a strong spam word in ham.
# So we don't give hamcount full credit when nham > nspam (or
# spamcount when nspam > nham): instead we knock hamcount down
# to what it would have been had nham been equal to nspam. IOW,
# we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
# we don't "believe" any count to an extent more than
# min(nspam, nham) justifies.
n = hamcount * spam2ham + spamcount * ham2spam
prob = (StimesX + n * prob) / (S + n)
# Update the cache
try:
self.probcache[spamcount][hamcount] = prob
except KeyError:
self.probcache[spamcount] = {hamcount: prob}
return prob
# NOTE: Graham's scheme had a strange asymmetry: when a word appeared
# n>1 times in a single message, training added n to the word's hamcount
# or spamcount, but predicting scored words only once. Tests showed
# that adding only 1 in training, or scoring more than once when
# predicting, hurt under the Graham scheme.
# This isn't so under Robinson's scheme, though: results improve
# if training also counts a word only once. The mean ham score decreases
# significantly and consistently, ham score variance decreases likewise,
# mean spam score decreases (but less than mean ham score, so the spread
# increases), and spam score variance increases.
# I (Tim) speculate that adding n times under the Graham scheme helped
# because it acted against the various ham biases, giving frequently
# repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
# adding only once in training, a word like that was simply ignored until
# it appeared in 5 distinct training spams. Without the ham-favoring
# biases, though, and never ignoring words, counting n times introduces
# a subtle and unhelpful bias.
# There does appear to be some useful info in how many times a word
# appears in a msg, but distorting spamprob doesn't appear a correct way
# to exploit it.
def _add_msg(self, wordstream, is_spam):
self.probcache = {} # nuke the prob cache
if is_spam:
self.nspam += 1
else:
self.nham += 1
for word in Set(wordstream):
record = self._wordinfoget(word)
if record is None:
record = self.WordInfoClass()
if is_spam:
record.spamcount += 1
else:
record.hamcount += 1
self._wordinfoset(word, record)
def _remove_msg(self, wordstream, is_spam):
self.probcache = {} # nuke the prob cache
if is_spam:
if self.nspam <= 0:
raise ValueError("spam count would go negative!")
self.nspam -= 1
else:
if self.nham <= 0:
raise ValueError("non-spam count would go negative!")
self.nham -= 1
for word in Set(wordstream):
record = self._wordinfoget(word)
if record is not None:
if is_spam:
if record.spamcount > 0:
record.spamcount -= 1
else:
if record.hamcount > 0:
record.hamcount -= 1
if record.hamcount == 0 == record.spamcount:
self._wordinfodel(word)
else:
self._wordinfoset(word, record)
def _getclues(self, wordstream):
mindist = options.minimum_prob_strength
unknown = options.unknown_word_prob
clues = [] # (distance, prob, word, record) tuples
pushclue = clues.append
for word in Set(wordstream):
record = self._wordinfoget(word)
if record is None:
prob = unknown
else:
prob = self.probability(record)
distance = abs(prob - 0.5)
if distance >= mindist:
pushclue((distance, prob, word, record))
clues.sort()
if len(clues) > options.max_discriminators:
del clues[0 : -options.max_discriminators]
# Return (prob, word, record).
return [t[1:] for t in clues]
def _wordinfoget(self, word):
return self.wordinfo.get(word)
def _wordinfoset(self, word, record):
self.wordinfo[word] = record
def _wordinfodel(self, word):
del self.wordinfo[word]
Bayes = Classifier
--- NEW FILE: dbmstorage.py ---
"""Wrapper to open an appropriate dbm storage type."""
from spambayes.Options import options
import sys
class error(Exception):
pass
def open_db3hash(*args):
"""Open a bsddb3 hash."""
import bsddb3
return bsddb3.hashopen(*args)
def open_dbhash(*args):
"""Open a bsddb hash. Don't use this on Windows."""
import bsddb
return bsddb.hashopen(*args)
def open_gdbm(*args):
"""Open a gdbm database."""
import gdbm
return gdbm.open(*args)
def open_dumbdbm(*args):
"""Open a dumbdbm database."""
import dumbdbm
return dumbdbm.open(*args)
def open_best(*args):
if sys.platform == "win32":
funcs = [open_db3hash, open_gdbm, open_dumbdbm]
else:
funcs = [open_db3hash, open_dbhash, open_gdbm, open_dumbdbm]
for f in funcs:
try:
return f(*args)
except ImportError:
pass
raise error("No dbm modules available!")
open_funcs = {
"best": open_best,
"db3hash": open_db3hash,
"dbhash": open_dbhash,
"gdbm": open_gdbm,
"dumbdbm": open_dumbdbm,
}
def open(*args):
dbm_type = options.dbm_type.lower()
f = open_funcs.get(dbm_type)
if not f:
raise error("Unknown dbm type in options file")
return f(*args)
--- NEW FILE: hammie.py ---
#! /usr/bin/env python
from spambayes import mboxutils
from spambayes import storage
from spambayes.Options import options
from spambayes.tokenizer import tokenize
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
class Hammie:
"""A spambayes mail filter.
This implements the basic functionality needed to score, filter, or
train.
"""
def __init__(self, bayes):
self.bayes = bayes
def _scoremsg(self, msg, evidence=False):
"""Score a Message.
msg can be a string, a file object, or a Message object.
Returns the probability the message is spam. If evidence is
true, returns a tuple: (probability, clues), where clues is a
list of the words which contributed to the score.
"""
return self.bayes.spamprob(tokenize(msg), evidence)
def formatclues(self, clues, sep="; "):
"""Format the clues into something readable."""
return sep.join(["%r: %.2f" % (word, prob)
for word, prob in clues
if (word[0] == '*' or
prob <= options.clue_mailheader_cutoff or
prob >= 1.0 - options.clue_mailheader_cutoff)])
def score(self, msg, evidence=False):
"""Score (judge) a message.
msg can be a string, a file object, or a Message object.
Returns the probability the message is spam. If evidence is
true, returns a tuple: (probability, clues), where clues is a
list of the words which contributed to the score.
"""
return self._scoremsg(msg, evidence)
def filter(self, msg, header=None, spam_cutoff=None,
ham_cutoff=None, debugheader=None,
debug=None):
"""Score (judge) a message and add a disposition header.
msg can be a string, a file object, or a Message object.
Optionally, set header to the name of the header to add, and/or
spam_cutoff/ham_cutoff to the probability values which must be met
or exceeded for a message to get a 'Spam' or 'Ham' classification.
An extra debugging header can be added if 'debug' is set to True.
The name of the debugging header is given as 'debugheader'.
All defaults for optional parameters come from the Options file.
Returns the same message with a new disposition header.
"""
if header == None:
header = options.hammie_header_name
if spam_cutoff == None:
spam_cutoff = options.spam_cutoff
if ham_cutoff == None:
ham_cutoff = options.ham_cutoff
if debugheader == None:
debugheader = options.hammie_debug_header_name
if debug == None:
debug = options.hammie_debug_header
msg = mboxutils.get_message(msg)
try:
del msg[header]
except KeyError:
pass
prob, clues = self._scoremsg(msg, True)
if prob < ham_cutoff:
disp = options.header_ham_string
elif prob > spam_cutoff:
disp = options.header_spam_string
else:
disp = options.header_unsure_string
disp += ("; %."+str(options.header_score_digits)+"f") % prob
if options.header_score_logarithm:
if prob<=0.005 and prob>0.0:
import math
x=-math.log10(prob)
disp += " (%d)"%x
if prob>=0.995 and prob<1.0:
import math
x=-math.log10(1.0-prob)
disp += " (%d)"%x
msg.add_header(header, disp)
if debug:
disp = self.formatclues(clues)
msg.add_header(debugheader, disp)
return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))
def train(self, msg, is_spam):
"""Train bayes with a message.
msg can be a string, a file object, or a Message object.
is_spam should be 1 if the message is spam, 0 if not.
"""
self.bayes.learn(tokenize(msg), is_spam)
def untrain(self, msg, is_spam):
"""Untrain bayes with a message.
msg can be a string, a file object, or a Message object.
is_spam should be 1 if the message is spam, 0 if not.
"""
self.bayes.unlearn(tokenize(msg), is_spam)
def train_ham(self, msg):
"""Train bayes with ham.
msg can be a string, a file object, or a Message object.
"""
self.train(msg, False)
def train_spam(self, msg):
"""Train bayes with spam.
msg can be a string, a file object, or a Message object.
"""
self.train(msg, True)
def untrain_ham(self, msg):
"""Untrain bayes with ham.
msg can be a string, a file object, or a Message object.
"""
self.untrain(msg, False)
def train_spam(self, msg):
"""Untrain bayes with spam.
msg can be a string, a file object, or a Message object.
"""
self.untrain(msg, True)
def store(self):
"""Write out the persistent store.
This makes sure the persistent store reflects what is currently
in memory. You would want to do this after a write and before
exiting.
"""
self.bayes.store()
def open(filename, usedb=True, mode='r'):
"""Open a file, returning a Hammie instance.
If usedb is False, open as a pickle instead of a DBDict. mode is
used as the flag to open DBDict objects. 'c' for read-write (create
if needed), 'r' for read-only, 'w' for read-write.
"""
if usedb:
b = storage.DBDictClassifier(filename, mode)
else:
b = storage.PickledClassifier(filename)
return Hammie(b)
if __name__ == "__main__":
# Everybody's used to running hammie.py. Why mess with success? ;)
import hammiebulk
hammiebulk.main()
--- NEW FILE: hammiebulk.py ---
#! /usr/bin/env python
"""Usage: %(program)s [-D|-d] [options]
Where:
-h
show usage and exit
-d
use the DBM store. A DBM file is larger than the pickle and
creating it is slower, but loading it is much faster,
especially for large word databases. Recommended for use with
hammiefilter or any procmail-based filter.
-D
use the pickle store. A pickle is smaller and faster to create,
but much slower to load. Recommended for use with pop3proxy and
hammiesrv.
-p FILE
use file as the persistent store. loads data from this file if it
exists, and saves data to this file at the end.
Default: %(DEFAULTDB)s
-f
run as a filter: read a single message from stdin, add a new
header, and write it to stdout. If you want to run from
procmail, this is your option.
-g PATH
mbox or directory of known good messages (non-spam) to train on.
Can be specified more than once, or use - for stdin.
-s PATH
mbox or directory of known spam messages to train on.
Can be specified more than once, or use - for stdin.
-u PATH
mbox of unknown messages. A ham/spam decision is reported for each.
Can be specified more than once.
-r
reverse the meaning of the check (report ham instead of spam).
Only meaningful with the -u option.
"""
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
def bool(val):
return not not val
import sys
import os
import types
import getopt
import mailbox
import glob
import email
import errno
import cPickle as pickle
from spambayes.Options import options
from spambayes import classifier, mboxutils, hammie, Corpus
Corpus.Verbose = True
program = sys.argv[0] # For usage(); referenced by docstring above
# Default database name
DEFAULTDB = os.path.expanduser(options.hammiefilter_persistent_storage_file)
# Probability at which a message is considered spam
SPAM_THRESHOLD = options.spam_cutoff
HAM_THRESHOLD = options.ham_cutoff
def train(h, msgs, is_spam):
"""Train bayes with all messages from a mailbox."""
mbox = mboxutils.getmbox(msgs)
i = 0
for msg in mbox:
i += 1
sys.stdout.write("\r%6d" % i)
sys.stdout.flush()
h.train(msg, is_spam)
print
def score(h, msgs, reverse=0):
"""Score (judge) all messages from a mailbox."""
# XXX The reporting needs work!
mbox = mboxutils.getmbox(msgs)
i = 0
spams = hams = 0
for msg in mbox:
i += 1
prob, clues = h.score(msg, True)
if hasattr(msg, '_mh_msgno'):
msgno = msg._mh_msgno
else:
msgno = i
isspam = (prob >= SPAM_THRESHOLD)
if isspam:
spams += 1
if not reverse:
print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
print h.formatclues(clues)
else:
hams += 1
if reverse:
print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
print h.formatclues(clues)
return (spams, hams)
def usage(code, msg=''):
"""Print usage message and sys.exit(code)."""
if msg:
print >> sys.stderr, msg
print >> sys.stderr
print >> sys.stderr, __doc__ % globals()
sys.exit(code)
def main():
"""Main program; parse options and go."""
try:
opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
except getopt.error, msg:
usage(2, msg)
if not opts:
usage(2, "No options given")
pck = DEFAULTDB
good = []
spam = []
unknown = []
reverse = 0
do_filter = False
usedb = None
mode = 'r'
for opt, arg in opts:
if opt == '-h':
usage(0)
elif opt == '-g':
good.append(arg)
mode = 'c'
elif opt == '-s':
spam.append(arg)
mode = 'c'
elif opt == '-p':
pck = arg
elif opt == "-d":
usedb = True
elif opt == "-D":
usedb = False
elif opt == "-f":
do_filter = True
elif opt == '-u':
unknown.append(arg)
elif opt == '-r':
reverse = 1
if args:
usage(2, "Positional arguments not allowed")
if usedb == None:
usage(2, "Must specify one of -d or -D")
save = False
h = hammie.open(pck, usedb, mode)
for g in good:
print "Training ham (%s):" % g
train(h, g, False)
save = True
for s in spam:
print "Training spam (%s):" % s
train(h, s, True)
save = True
if save:
h.store()
if do_filter:
msg = sys.stdin.read()
filtered = h.filter(msg)
sys.stdout.write(filtered)
if unknown:
(spams, hams) = (0, 0)
for u in unknown:
if len(unknown) > 1:
print "Scoring", u
s, g = score(h, u, reverse)
spams += s
hams += g
print "Total %d spam, %d ham" % (spams, hams)
if __name__ == "__main__":
main()
--- NEW FILE: mboxutils.py ---
#! /usr/bin/env python
"""Utilities for dealing with various types of mailboxes.
This is mostly a wrapper around the various useful classes in the
standard mailbox module, to do some intelligent guessing of the
mailbox type given a mailbox argument.
+foo -- MH mailbox +foo
+foo,bar -- MH mailboxes +foo and +bar concatenated
+ALL -- a shortcut for *all* MH mailboxes
/foo/bar -- (existing file) a Unix-style mailbox
/foo/bar/ -- (existing directory) a directory full of .txt and .lorien
files
/foo/bar/ -- (existing directory with a cur/ subdirectory)
Maildir mailbox
/foo/Mail/bar/ -- (existing directory with /Mail/ in its path)
alternative way of spelling an MH mailbox
"""
from __future__ import generators
import os
import sys
import glob
import email
import mailbox
import email.Message
import re
class DirOfTxtFileMailbox:
"""Mailbox directory consisting of .txt and .lorien files."""
def __init__(self, dirname, factory):
self.names = (glob.glob(os.path.join(dirname, "*.txt")) +
glob.glob(os.path.join(dirname, "*.lorien")))
self.names.sort()
self.factory = factory
def __iter__(self):
for name in self.names:
try:
f = open(name)
except IOError:
continue
yield self.factory(f)
f.close()
def _cat(seqs):
for seq in seqs:
for item in seq:
yield item
def getmbox(name):
"""Return an mbox iterator given a file/directory/folder name."""
if name == "-":
return [get_message(sys.stdin)]
if name.startswith("+"):
# MH folder name: +folder, +f1,f2,f2, or +ALL
name = name[1:]
import mhlib
mh = mhlib.MH()
if name == "ALL":
names = mh.listfolders()
elif ',' in name:
names = name.split(',')
else:
names = [name]
mboxes = []
mhpath = mh.getpath()
for name in names:
filename = os.path.join(mhpath, name)
mbox = mailbox.MHMailbox(filename, get_message)
mboxes.append(mbox)
if len(mboxes) == 1:
return iter(mboxes[0])
else:
return _cat(mboxes)
if os.path.isdir(name):
# XXX Bogus: use a Maildir if /cur is a subdirectory, else a MHMailbox
# if the pathname contains /Mail/, else a DirOfTxtFileMailbox.
if os.path.exists(os.path.join(name, 'cur')):
mbox = mailbox.Maildir(name, get_message)
elif name.find("/Mail/") >= 0:
mbox = mailbox.MHMailbox(name, get_message)
else:
mbox = DirOfTxtFileMailbox(name, get_message)
else:
fp = open(name, "rb")
mbox = mailbox.PortableUnixMailbox(fp, get_message)
return iter(mbox)
def get_message(obj):
"""Return an email Message object.
The argument may be a Message object already, in which case it's
returned as-is.
If the argument is a string or file-like object (supports read()),
the email package is used to create a Message object from it. This
can fail if the message is malformed. In that case, the headers
(everything through the first blank line) are thrown out, and the
rest of the text is wrapped in a bare email.Message.Message.
"""
if isinstance(obj, email.Message.Message):
return obj
# Create an email Message object.
if hasattr(obj, "read"):
obj = obj.read()
try:
msg = email.message_from_string(obj)
except email.Errors.MessageParseError:
# Wrap the raw text in a bare Message object. Since the
# headers are most likely damaged, we can't use the email
# package to parse them, so just get rid of them first.
headers = extract_headers(obj)
obj = obj[len(headers):]
msg = email.Message.Message()
msg.set_payload(obj)
return msg
header_break_re = re.compile(r"\r?\n(\r?\n)")
def extract_headers(text):
"""Very simple-minded header extraction: prefix of text up to blank line.
A blank line is recognized via two adjacent line-ending sequences, where
a line-ending sequence is a newline optionally preceded by a carriage
return.
If no blank line is found, all of text is considered to be a potential
header section. If a blank line is found, the text up to (but not
including) the blank line is considered to be a potential header section.
The potential header section is returned, unless it doesn't contain a
colon, in which case an empty string is returned.
>>> extract_headers("abc")
''
>>> extract_headers("abc\\n\\n\\n") # no colon
''
>>> extract_headers("abc: xyz\\n\\n\\n")
'abc: xyz\\n'
>>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
'abc: xyz\\r\\n'
>>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
'a: b\\ngibberish\\n'
"""
m = header_break_re.search(text)
if m:
eol = m.start(1)
text = text[:eol]
if ':' not in text:
text = ""
return text
def _test():
import doctest, mboxutils
return doctest.testmod(mboxutils)
if __name__ == "__main__":
_test()
--- NEW FILE: msgs.py ---
from __future__ import generators
import os
import random
from spambayes.tokenizer import tokenize
HAMTEST = None
SPAMTEST = None
HAMTRAIN = None
SPAMTRAIN = None
SEED = random.randrange(2000000000)
class Msg(object):
__slots__ = 'tag', 'guts'
def __init__(self, dir, name):
path = dir + "/" + name
self.tag = path
f = open(path, 'rb')
self.guts = f.read()
f.close()
def __iter__(self):
return tokenize(self.guts)
# Compare msgs by their paths; this is appropriate for sets of msgs.
def __hash__(self):
return hash(self.tag)
def __eq__(self, other):
return self.tag == other.tag
def __str__(self):
return self.guts
# The iterator yields a stream of Msg objects, taken from a list of
# directories.
class MsgStream(object):
__slots__ = 'tag', 'directories', 'keep'
def __init__(self, tag, directories, keep=None):
self.tag = tag
self.directories = directories
self.keep = keep
def __str__(self):
return self.tag
def produce(self):
if self.keep is None:
for directory in self.directories:
for fname in os.listdir(directory):
yield Msg(directory, fname)
return
# We only want part of the msgs. Shuffle each directory list, but
# in such a way that we'll get the same result each time this is
# called on the same directory list.
for directory in self.directories:
all = os.listdir(directory)
random.seed(hash(max(all)) ^ SEED) # reproducible across calls
random.shuffle(all)
del all[self.keep:]
all.sort() # seems to speed access on Win98!
for fname in all:
yield Msg(directory, fname)
def __iter__(self):
return self.produce()
class HamStream(MsgStream):
def __init__(self, tag, directories, train=0):
if train:
MsgStream.__init__(self, tag, directories, HAMTRAIN)
else:
MsgStream.__init__(self, tag, directories, HAMTEST)
class SpamStream(MsgStream):
def __init__(self, tag, directories, train=0):
if train:
MsgStream.__init__(self, tag, directories, SPAMTRAIN)
else:
MsgStream.__init__(self, tag, directories, SPAMTEST)
def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
"""Set HAMTEST/TRAIN and SPAMTEST/TRAIN.
If seed is not None, also set SEED.
If (ham|spam)test are not set, set to the same as the (ham|spam)train
numbers (backwards compat option).
"""
global HAMTEST, SPAMTEST, HAMTRAIN, SPAMTRAIN, SEED
HAMTRAIN, SPAMTRAIN = hamtrain, spamtrain
if hamtest is None:
HAMTEST = HAMTRAIN
else:
HAMTEST = hamtest
if spamtest is None:
SPAMTEST = SPAMTRAIN
else:
SPAMTEST = spamtest
if seed is not None:
SEED = seed
--- NEW FILE: optimize.py ---
#
__version__ = '$Id: optimize.py,v 1.1.2.1 2003/01/10 10:41:08 anthonybaxter Exp $'
#
# Optimize any parametric function.
#
import copy
import Numeric
def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001):
var = Numeric.array(var)
simplex = [var]
for i in range(len(var)):
var2 = copy.copy(var)
var2[i] = var[i] + err[i]
simplex.append(var2)
value = []
for i in range(len(simplex)):
value.append(func(simplex[i]))
while 1:
# Determine worst and best
wi = 0
bi = 0
for i in range(len(simplex)):
if value[wi] > value[i]:
wi = i
if value[bi] < value[i]:
bi = i
# Test for convergence
#print "worst, best are",wi,bi,"with",value[wi],value[bi]
if abs(value[bi] - value[wi]) <= convcrit:
return simplex[bi]
# Calculate average of non-worst
ave=Numeric.zeros(len(var), 'd')
for i in range(len(simplex)):
if i != wi:
ave = ave + simplex[i]
ave = ave / (len(simplex) - 1)
worst = Numeric.array(simplex[wi])
# Check for too-small simplex
simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
if simsize <= minerr:
#print "Size of simplex too small:",simsize
return simplex[bi]
# Invert worst
new = 2 * ave - simplex[wi]
newv = func(new)
if newv <= value[wi]:
# Even worse. Shrink instead
#print "Shrunk simplex"
#print "ave=",repr(ave)
#print "wi=",repr(worst)
new = 0.5 * ave + 0.5 * worst
newv = func(new)
elif newv > value[bi]:
# Better than the best. Expand
new2 = 3 * ave - 2 * worst
newv2 = func(new2)
if newv2 > newv:
# Accept
#print "Expanded simplex"
new = new2
newv = newv2
simplex[wi] = new
value[wi] = newv
def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
err = Numeric.array(err)
var = SimplexMaximize(var, err, func, convcrit*5, minerr*5)
return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr)
--- NEW FILE: storage.py ---
#! /usr/bin/env python
'''storage.py - Spambayes database management framework.
Classes:
PickledClassifier - Classifier that uses a pickle db
DBDictClassifier - Classifier that uses a shelve db
Trainer - Classifier training observer
SpamTrainer - Trainer for spam
HamTrainer - Trainer for ham
Abstract:
*Classifier are subclasses of Classifier (classifier.Classifier)
that add automatic state store/restore function to the Classifier class.
PickledClassifier is a Classifier class that uses a cPickle
datastore. This database is relatively small, but slower than other
databases.
DBDictClassifier is a Classifier class that uses a database
store.
Trainer is concrete class that observes a Corpus and trains a
Classifier object based upon movement of messages between corpora When
an add message notification is received, the trainer trains the
database with the message, as spam or ham as appropriate given the
type of trainer (spam or ham). When a remove message notification
is received, the trainer untrains the database as appropriate.
SpamTrainer and HamTrainer are convenience subclasses of Trainer, that
initialize as the appropriate type of Trainer
To Do:
o ZODBClassifier
o Would Trainer.trainall really want to train with the whole corpus,
or just a random subset?
o Suggestions?
'''
# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.
__author__ = "Neale Pickett <neale at woozle.org>, \
Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "All the spambayes contributors."
try:
True, False
except NameError:
# Maintain compatibility with Python 2.2
True, False = 1, 0
def bool(val):
return not not val
from spambayes import classifier
from spambayes.Options import options
import cPickle as pickle
import errno
import shelve
from spambayes import dbmstorage
# Make shelve use binary pickles by default.
oldShelvePickler = shelve.Pickler
def binaryDefaultPickler(f, binary=1):
return oldShelvePickler(f, binary)
shelve.Pickler = binaryDefaultPickler
PICKLE_TYPE = 1
NO_UPDATEPROBS = False # Probabilities will not be autoupdated with training
UPDATEPROBS = True # Probabilities will be autoupdated with training
class PickledClassifier(classifier.Classifier):
'''Classifier object persisted in a pickle'''
def __init__(self, db_name):
classifier.Classifier.__init__(self)
self.db_name = db_name
self.load()
def load(self):
'''Load this instance from the pickle.'''
# This is a bit strange, because the loading process
# creates a temporary instance of PickledClassifier, from which
# this object's state is copied. This is a nuance of the way
# that pickle does its job
if options.verbose:
print 'Loading state from',self.db_name,'pickle'
tempbayes = None
try:
fp = open(self.db_name, 'rb')
except IOError, e:
if e.errno != errno.ENOENT: raise
else:
tempbayes = pickle.load(fp)
fp.close()
# XXX: why not self.__setstate__(tempbayes.__getstate__())?
if tempbayes:
self.wordinfo = tempbayes.wordinfo
self.nham = tempbayes.nham
self.nspam = tempbayes.nspam
if options.verbose:
print '%s is an existing pickle, with %d ham and %d spam' \
% (self.db_name, self.nham, self.nspam)
else:
# new pickle
if options.verbose:
print self.db_name,'is a new pickle'
self.wordinfo = {}
self.nham = 0
self.nspam = 0
def store(self):
'''Store self as a pickle'''
if options.verbose:
print 'Persisting',self.db_name,'as a pickle'
fp = open(self.db_name, 'wb')
pickle.dump(self, fp, PICKLE_TYPE)
fp.close()
class DBDictClassifier(classifier.Classifier):
'''Classifier object persisted in a caching database'''
def __init__(self, db_name, mode='c'):
'''Constructor(database name)'''
classifier.Classifier.__init__(self)
self.wordcache = {}
self.statekey = "saved state"
self.mode = mode
self.db_name = db_name
self.load()
def load(self):
'''Load state from database'''
if options.verbose:
print 'Loading state from',self.db_name,'database'
self.dbm = dbmstorage.open(self.db_name, self.mode)
self.db = shelve.Shelf(self.dbm)
if self.db.has_key(self.statekey):
t = self.db[self.statekey]
if t[0] != classifier.PICKLE_VERSION:
raise ValueError("Can't unpickle -- version %s unknown" % t[0])
(self.nspam, self.nham) = t[1:]
if options.verbose:
print '%s is an existing database, with %d spam and %d ham' \
% (self.db_name, self.nspam, self.nham)
else:
# new database
if options.verbose:
print self.db_name,'is a new database'
self.nspam = 0
self.nham = 0
self.wordinfo = {}
def store(self):
'''Place state into persistent store'''
if options.verbose:
print 'Persisting',self.db_name,'state in database'
# Must use .keys() since we modify the dict in the loop
for key in self.wordinfo.keys():
val = self.wordinfo[key]
if val == None:
del self.wordinfo[key]
try:
del self.db[key]
except KeyError:
pass
else:
self.db[key] = val.__getstate__()
self.db[self.statekey] = (classifier.PICKLE_VERSION,
self.nspam, self.nham)
self.db.sync()
def _wordinfoget(self, word):
ret = self.wordinfo.get(word)
if not ret:
r = self.db.get(word)
if r:
ret = self.WordInfoClass()
ret.__setstate__(r)
self.wordinfo[word] = ret
return ret
# _wordinfoset is the same
def _wordinfodel(self, word):
self.wordinfo[word] = None
class Trainer:
'''Associates a Classifier object and one or more Corpora, \
is an observer of the corpora'''
def __init__(self, bayes, is_spam, updateprobs=NO_UPDATEPROBS):
'''Constructor(Classifier, is_spam(True|False), updprobs(True|False)'''
self.bayes = bayes
self.is_spam = is_spam
self.updateprobs = updateprobs
def onAddMessage(self, message):
'''A message is being added to an observed corpus.'''
self.train(message)
def train(self, message):
'''Train the database with the message'''
if options.verbose:
print 'training with',message.key()
self.bayes.learn(message.tokenize(), self.is_spam)
# self.updateprobs)
def onRemoveMessage(self, message):
'''A message is being removed from an observed corpus.'''
self.untrain(message)
def untrain(self, message):
'''Untrain the database with the message'''
if options.verbose:
print 'untraining with',message.key()
self.bayes.unlearn(message.tokenize(), self.is_spam)
# self.updateprobs)
# can raise ValueError if database is fouled. If this is the case,
# then retraining is the only recovery option.
def trainAll(self, corpus):
'''Train all the messages in the corpus'''
for msg in corpus:
self.train(msg)
def untrainAll(self, corpus):
'''Untrain all the messages in the corpus'''
for msg in corpus:
self.untrain(msg)
class SpamTrainer(Trainer):
'''Trainer for spam'''
def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
'''Constructor'''
Trainer.__init__(self, bayes, True, updateprobs)
class HamTrainer(Trainer):
'''Trainer for ham'''
def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
'''Constructor'''
Trainer.__init__(self, bayes, False, updateprobs)
if __name__ == '__main__':
print >>sys.stderr, __doc__
--- NEW FILE: tokenizer.py ---
#! /usr/bin/env python
"""Module to tokenize email messages for spam filtering."""
from __future__ import generators
import email
import email.Message
import email.Header
import email.Utils
import email.Errors
import re
import math
import time
import os
from sets import Set
from spambayes.Options import options
from spambayes.mboxutils import get_message
[...1305 lines suppressed...]
text, tokens = cracker(text)
for t in tokens:
yield t
# Remove HTML/XML tags. Also .
text = text.replace(' ', ' ')
text = html_re.sub(' ', text)
# Tokenize everything in the body.
for w in text.split():
n = len(w)
# Make sure this range matches in tokenize_word().
if 3 <= n <= maxword:
yield w
elif n >= 3:
for t in tokenize_word(w):
yield t
tokenize = Tokenizer().tokenize
- Previous message: [Spambayes-checkins] spambayes/pspam/pspam profile.py,1.6,1.6.2.1
- Next message: [Spambayes-checkins]
spambayes/testtools cmp.py,NONE,1.1.2.1 fpfn.py,NONE,1.1.2.1
mboxtest.py,NONE,1.1.2.1 rates.py,NONE,1.1.2.1
simplexloop.py,NONE,1.1.2.1 table.py,NONE,1.1.2.1
timcv.py,NONE,1.1.2.1 timtest.py,NONE,1.1.2.1weaktest.py,NONE,1.1.2.1
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
More information about the Spambayes-checkins
mailing list