Fri Jan 10 02:41:10 EST 2003

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv9389/spambayes

Added Files:
      Tag: reorg-branch
	Corpus.py CostCounter.py FileCorpus.py Histogram.py Options.py 
	TestDriver.py Tester.py __init__.py cdb.py chi2.py 
	classifier.py dbmstorage.py hammie.py hammiebulk.py 
	mboxutils.py msgs.py optimize.py storage.py tokenizer.py 
Log Message:
Checkpointing before I head home.

Still to do: 
 - distutils magic to make sure that the 22compat modules are 
   installed when needed.
 - Walking through testtools and utilities and fixing imports.
 - Documentation.

hammie works, everything else that people use in day-to-day operation
should work - please give it a go.

--- NEW FILE: Corpus.py ---
#! /usr/bin/env python

'''Corpus.py - Spambayes corpus management framework.

Classes:
    Corpus - a collection of Messages
    ExpiryCorpus - a "young" Corpus
    Message - a subject of Spambayes training
    MessageFactory - creates a Message

Abstract:
    A corpus is defined as a set of messages that share some common
    characteristic relative to spamness.  Examples might be spam, ham,
    unsure, or untrained, or "bayes rating between .4 and .6.  A
    corpus is a collection of messages.  Corpus is a dictionary that
    is keyed by the keys of the messages within it.  It is iterable,
    and observable.  Observers are notified when a message is added
    to or removed from the corpus.

    Corpus is designed to cache message objects.  By default, it will
    only engage in lazy creation of message objects, keeping those
    objects in memory until the corpus instance itself is destroyed.
    In large corpora, this could consume a large amount of memory.  A
    cacheSize operand is implemented on the constructor, which is used
    to limit the *number* of messages currently loaded into memory.
    The instance variable that implements this cache is
    Corpus.Corpus.msgs, a dictionary.  Access to this variable should
    be through keys(), [key], or using an iterator.  Direct access
    should not be used, as subclasses that manage their cache may use
    this variable very differently.

    Iterating Corpus objects is potentially very expensive, as each
    message in the corpus will be brought into memory.  For large
    corpora, this could consume a lot of system resources.

    ExpiryCorpus is designed to keep a corpus of file messages that
    are guaranteed to be younger than a given age.  The age is
    specified on the constructor, as a number of seconds in the past.
    If a message file was created before that point in time, the a
    message is deemed to be "old" and thus ignored.  Access to a
    message that is deemed to be old will raise KeyError, which should
    be handled by the corpus user as appropriate.  While iterating,
    KeyError is handled by the iterator, and messages that raise
    KeyError are ignored.

    As messages pass their "expiration date," they are eligible for
    removal from the corpus. To remove them properly,
    removeExpiredMessages() should be called.  As messages are removed,
    observers are notified.

    ExpiryCorpus function is included into a concrete Corpus through
    multiple inheritance. It must be inherited before any inheritance
    that derives from Corpus.  For example:

        class RealCorpus(Corpus)
           ...

        class ExpiryRealCorpus(Corpus.ExpiryCorpus, RealCorpus)
           ...

    Messages have substance, which is is the textual content of the
    message. They also have a key, which uniquely defines them within
    the corpus.  This framework makes no assumptions about how or if
    messages persist.

    MessageFactory is a required factory class, because Corpus is
    designed to do lazy initialization of messages and as an abstract
    class, must know how to create concrete instances of the correct
    class.

To Do:
    o Suggestions?

    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."

from __future__ import generators

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
    def bool(val):
        return not not val

import sys           # for output of docstring
import time
import re
from spambayes import tokenizer
from spambayes.Options import options

SPAM = True
HAM = False

class Corpus:
    '''An observable dictionary of Messages'''

    def __init__(self, factory, cacheSize=-1):
        '''Constructor(MessageFactory)'''

        self.msgs = {}            # dict of all messages in corpus
                                  # value is None if msg not currently loaded
        self.keysInMemory = []    # keys of messages currently loaded
                                  # this *could* be derived by iterating msgs
        self.cacheSize = cacheSize  # max number of messages in memory
        self.observers = []       # observers of this corpus
        self.factory = factory    # factory for the correct Message subclass

    def addObserver(self, observer):
        '''Register an observer, which must implement
        onAddMessage, onRemoveMessage'''

        self.observers.append(observer)

    def addMessage(self, message):
        '''Add a Message to this corpus'''

        if options.verbose:
            print 'adding message %s to corpus' % (message.key())

        self.cacheMessage(message)

        for obs in self.observers:
            # there is no reason that a Corpus observer MUST be a Trainer
            # and so it may very well not be interested in AddMessage events
            # even though right now the only observable events are
            # training related
            try:
                obs.onAddMessage(message)
            except AttributeError:   # ignore if not implemented
                pass

    def removeMessage(self, message):
        '''Remove a Message from this corpus'''

        key = message.key()
        if options.verbose:
            print 'removing message %s from corpus' % (key)
        self.unCacheMessage(key)
        del self.msgs[key]

        for obs in self.observers:
            # see comments in event loop in addMessage
            try:
                obs.onRemoveMessage(message)
            except AttributeError:
                pass

    def cacheMessage(self, message):
        '''Add a message to the in-memory cache'''
        # This method should probably not be overridden

        key = message.key()

        if options.verbose:
            print 'placing %s in corpus cache' % (key)

        self.msgs[key] = message

        # Here is where we manage the in-memory cache size...
        self.keysInMemory.append(key)

        if self.cacheSize > 0:       # performance optimization
            if len(self.keysInMemory) > self.cacheSize:
                keyToFlush = self.keysInMemory[0]
                self.unCacheMessage(keyToFlush)

    def unCacheMessage(self, key):
        '''Remove a message from the in-memory cache'''
        # This method should probably not be overridden

        if options.verbose:
            print 'Flushing %s from corpus cache' % (key)

        try:
            ki = self.keysInMemory.index(key)
        except ValueError:
            pass
        else:
            del self.keysInMemory[ki]

        self.msgs[key] = None

    def takeMessage(self, key, fromcorpus):
        '''Move a Message from another corpus to this corpus'''

        # XXX Hack: Calling msg.getSubstance() here ensures that the
        # message substance is in memory.  If it isn't, when addMessage()
        # calls message.store(), which calls message.getSubstance(), that
        # will try to load the substance from the as-yet-unwritten new file.
        msg = fromcorpus[key]
        msg.getSubstance()
        fromcorpus.removeMessage(msg)
        self.addMessage(msg)

    def __getitem__(self, key):
        '''Corpus is a dictionary'''

        amsg = self.msgs[key]

        if not amsg:
            amsg = self.makeMessage(key)     # lazy init, saves memory
            self.cacheMessage(amsg)

        return amsg

    def keys(self):
        '''Message keys in the Corpus'''

        return self.msgs.keys()

    def __iter__(self):
        '''Corpus is iterable'''

        for key in self.keys():
            try:
                yield self[key]
            except KeyError:
                pass

    def __str__(self):
        '''Instance as a printable string'''

        return self.__repr__()

    def __repr__(self):
        '''Instance as a representative string'''

        raise NotImplementedError

    def makeMessage(self, key):
        '''Call the factory to make a message'''

        # This method will likely be overridden
        msg = self.factory.create(key)

        return msg

class ExpiryCorpus:
    '''Corpus of "young" file system artifacts'''

    def __init__(self, expireBefore):
        '''Constructor'''

        self.expireBefore = expireBefore

    def removeExpiredMessages(self):
        '''Kill expired messages'''

        for msg in self:
            if msg.createTimestamp() < time.time() - self.expireBefore:
                if options.verbose:
                    print 'message %s has expired' % (key)
                self.removeMessage(msg)

class Message:
    '''Abstract Message class'''

    def __init__(self):
        '''Constructor()'''

        # The text of the message headers and body are held in attributes
        # called 'hdrtxt' and 'payload', created on demand in __getattr__
        # by calling load(), which should in turn call setSubstance().
        # This means you don't need to remember to call load() before
        # using these attributes.

    def __getattr__(self, attributeName):
        '''On-demand loading of the message text.'''

        if attributeName in ('hdrtxt', 'payload'):
            self.load()
        return getattr(self, attributeName)

    def load(self):
        '''Method to load headers and body'''

        raise NotImplementedError

    def store(self):
        '''Method to persist a message'''

        raise NotImplementedError

    def remove(self):
        '''Method to obliterate a message'''

        raise NotImplementedError

    def __repr__(self):
        '''Instance as a representative string'''

        raise NotImplementedError

    def __str__(self):
        '''Instance as a printable string'''

        return self.getSubstance()

    def name(self):
        '''Message may have a unique human readable name'''

        return self.__repr__()

    def key(self):
        '''The key for this instance'''

        raise NotImplementedError

    def setSubstance(self, sub):
        '''set this message substance'''

        bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
        bmatch = bodyRE.search(sub)
        if bmatch:
            self.payload = bmatch.group(2)
            self.hdrtxt = sub[:bmatch.start(2)]

    def getSubstance(self):
        '''Return this message substance'''

        return self.hdrtxt + self.payload

    def setSpamprob(self, prob):
        '''Score of the last spamprob calc, may not be persistent'''

        self.spamprob = prob

    def tokenize(self):
        '''Returns substance as tokens'''

        return tokenizer.tokenize(self.getSubstance())

    def createTimeStamp(self):
        '''Returns the create time of this message'''
        # Should return a timestamp like time.time()

        raise NotImplementedError

    def getFrom(self):
        '''Return a message From header content'''

        if self.hdrtxt:
            match = re.search(r'^From:(.*)$', self.hdrtxt, re.MULTILINE)
            return match.group(1)
        else:
            return None

    def getSubject(self):
        '''Return a message Subject header contents'''

        if self.hdrtxt:
            match = re.search(r'^Subject:(.*)$', self.hdrtxt, re.MULTILINE)
            return match.group(1)
        else:
            return None

    def getDate(self):
        '''Return a message Date header contents'''

        if self.hdrtxt:
            match = re.search(r'^Date:(.*)$', self.hdrtxt, re.MULTILINE)
            return match.group(1)
        else:
            return None

    def getHeadersList(self):
        '''Return a list of message header tuples'''

        hdrregex = re.compile(r'^([A-Za-z0-9-_]*): ?(.*)$', re.MULTILINE)
        data = re.sub(r'\r?\n\r?\s',' ',self.hdrtxt,re.MULTILINE)
        match = hdrregex.findall(data)

        return match

    def getHeaders(self):
        '''Return message headers as text'''

        return self.hdrtxt

    def getPayload(self):
        '''Return the message body'''

        return self.payload

    def stripSBDHeader(self):
        '''Removes the X-Spambayes-Disposition: header from the message'''

        # This is useful for training, where a spammer may be spoofing
        # our header, to make sure that our header doesn't become an
        # overweight clue to hamminess

        raise NotImplementedError

class MessageFactory:
    '''Abstract Message Factory'''

    def __init__(self):
        '''Constructor()'''
        pass

    def create(self, key):
        '''Create a message instance'''

        raise NotImplementedError

if __name__ == '__main__':
    print >>sys.stderr, __doc__

--- NEW FILE: CostCounter.py ---
from spambayes.Options import options

class CostCounter:
    name = "Superclass Cost"

    def __init__(self):
        self.total = 0

    def spam(self, scr):
        pass

    def ham(self, scr):
        pass

    def __str__(self):
        return "%s: $%.4f" % (self.name, self.total)

class CompositeCostCounter:
    def __init__(self,cclist):
        self.clients = cclist

    def spam(self, scr):
        for c in self.clients:
             c.spam(scr)

    def ham(self, scr):
        for c in self.clients:
            c.ham(scr)

    def __str__(self):
        s = []
        for c in self.clients:
            s.append(str(c))
        return '\n'.join(s)

class DelayedCostCounter(CompositeCostCounter):
    def __init__(self,cclist):
        CompositeCostCounter.__init__(self,cclist)
        self.spamscr=[]
        self.hamscr=[]

    def spam(self, scr):
        self.spamscr.append(scr)

    def ham(self, scr):
        self.hamscr.append(scr)

    def __str__(self):
        for scr in self.spamscr:
            CompositeCostCounter.spam(self,scr)
        for scr in self.hamscr:
            CompositeCostCounter.ham(self,scr)
        s=[]
        for line in CompositeCostCounter.__str__(self).split('\n'):
            s.append('Delayed-'+line)
        return '\n'.join(s)

class CountCostCounter(CostCounter):
    def __init__(self):
        CostCounter.__init__(self)
        self._fp = 0
        self._fn = 0
        self._unsure = 0
        self._unsureham = 0
        self._unsurespam = 0
        self._spam = 0
        self._ham = 0
        self._correctham = 0
        self._correctspam = 0
        self._total = 0

    def spam(self, scr):
        self._total += 1
        self._spam += 1
        if scr < options.ham_cutoff:
            self._fn += 1
        elif scr < options.spam_cutoff:
            self._unsure += 1
            self._unsurespam += 1
        else:
            self._correctspam += 1

    def ham(self, scr):
        self._total += 1
        self._ham += 1
        if scr > options.spam_cutoff:
            self._fp += 1
        elif scr > options.ham_cutoff:
            self._unsure += 1
            self._unsureham += 1
        else:
            self._correctham += 1

    def __str__(self):
         return ("Total messages: %d; %d (%.1f%%) ham + %d (%.1f%%) spam\n"%(
                     self._total,
                     self._ham, zd(100.*self._ham,self._total),
                     self._spam, zd(100.*self._spam,self._total))+
                 "Ham: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fp\n"%(
                     self._correctham, zd(100.*self._correctham,self._ham),
                     self._unsureham, zd(100.*self._unsureham,self._ham),
                     self._fp, zd(100.*self._fp,self._ham))+
                 "Spam: %d (%.2f%%) ok, %d (%.2f%%) unsure, %d (%.2f%%) fn\n"%(
                     self._correctspam, zd(100.*self._correctspam,self._spam),
                     self._unsurespam, zd(100.*self._unsurespam,self._spam),
                     self._fn, zd(100.*self._fn,self._spam))+
                 "Score False: %.2f%% Unsure %.2f%%"%(
                     zd(100.*(self._fp+self._fn),self._total),
                     zd(100.*self._unsure,self._total)))

def zd(x,y):
    if y > 0:
       return x / y
    else:
       return 0

class StdCostCounter(CostCounter):
    name = "Standard Cost"
    def spam(self, scr):
        if scr < options.ham_cutoff:
            self.total += options.best_cutoff_fn_weight
        elif scr < options.spam_cutoff:
            self.total += options.best_cutoff_unsure_weight

    def ham(self, scr):
        if scr > options.spam_cutoff:
            self.total += options.best_cutoff_fp_weight
        elif scr > options.ham_cutoff:
            self.total += options.best_cutoff_unsure_weight

class FlexCostCounter(CostCounter):
    name = "Flex Cost"
    def _lambda(self, scr):
        if scr < options.ham_cutoff:
	    return 0
        elif scr > options.spam_cutoff:
            return 1
        else:
            return (scr - options.ham_cutoff) / (
                      options.spam_cutoff - options.ham_cutoff)

    def spam(self, scr):
        self.total += (1 - self._lambda(scr)) * options.best_cutoff_fn_weight

    def ham(self, scr):
        self.total += self._lambda(scr) * options.best_cutoff_fp_weight

class Flex2CostCounter(FlexCostCounter):
    name = "Flex**2 Cost"
    def spam(self, scr):
        self.total += (1 - self._lambda(scr))**2 * options.best_cutoff_fn_weight

    def ham(self, scr):
        self.total += self._lambda(scr)**2 * options.best_cutoff_fp_weight

def default():
     return CompositeCostCounter([
                CountCostCounter(),
                StdCostCounter(),
                FlexCostCounter(),
                Flex2CostCounter(),
                DelayedCostCounter([
                    CountCostCounter(),
                    StdCostCounter(),
                    FlexCostCounter(),
                    Flex2CostCounter(),
                ])
            ])

def nodelay():
     return CompositeCostCounter([
                CountCostCounter(),
                StdCostCounter(),
                FlexCostCounter(),
                Flex2CostCounter(),
            ])

if __name__=="__main__":
    cc=default()
    cc.ham(0)
    cc.spam(1)
    cc.ham(0.5)
    cc.spam(0.5)
    options.spam_cutoff=0.7
    options.ham_cutoff=0.4
    print cc

--- NEW FILE: FileCorpus.py ---
#! /usr/bin/env python

"""FileCorpus.py - Corpus composed of file system artifacts

Classes:
    FileCorpus - an observable dictionary of FileMessages
    ExpiryFileCorpus - a FileCorpus of young files
    FileMessage - a subject of Spambayes training
    FileMessageFactory - a factory to create FileMessage objects
    GzipFileMessage - A FileMessage zipped for less storage
    GzipFileMessageFactory - factory to create GzipFileMessage objects

Abstract:
    These classes are concrete implementations of the Corpus framework.

    FileCorpus is designed to manage corpora that are directories of
    message files.

    ExpiryFileCorpus is an ExpiryCorpus of file messages.

    FileMessage manages messages that are files in the file system.

    FileMessageFactory is responsible for the creation of FileMessages,
    in response to requests to a corpus for messages.

    GzipFileMessage and GzipFileMessageFactory are used to persist messages
    as zipped files.  This can save a bit of persistent storage, though the
    ability of the compresser to do very much deflation is limited due to the
    relatively small size of the average textual message.  Still, for a large
    corpus, this could amount to a significant space savings.

    See Corpus.__doc__ for more information.

Test harness:
    FileCorpus [options]

        options:
            -h : show this message
            -v : execute in verbose mode, useful for general understanding
                 and debugging purposes
            -g : use GzipFileMessage and GzipFileMessageFactory
            -s : setup self test, useful for seeing what is going into the
                 test
            -t : setup and execute a self test.
            -c : clean up file system after self test

    Please note that running with -s or -t will create file system artifacts
    in the current directory.  Be sure this doesn't stomp something of
    yours...  The artifacts created are:

        fctestmisc.bayes
        fctestclass.bayes
        fctestspamcorpus/MSG00001
        fctestspamcorpus/MSG00002
        fctestunsurecorpus/MSG00003
        fctestunsurecorpus/MSG00004
        fctestunsurecorpus/MSG00005
        fctestunsurecorpus/MSG00006
        fctesthamcorpus/

    After the test has executed, the following file system artifacts
    (should) will exist:

        fctestmisc.bayes
        fctestclass.bayes
        fctestspamcorpus/MSG00001
        fctestspamcorpus/MSG00004
        fctesthamcorpus/MSG00002
        fctesthamcorpus/MSG00005
        fctesthamcorpus/MSG00006
        fctestunsurecorpus/

To Do:
    o Suggestions?

"""

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "Richie Hindle, Tim Peters, all the spambayes contributors."

from __future__ import generators

from spambayes import Corpus
from spambayes import storage
import sys, os, gzip, fnmatch, getopt, errno, time, stat
from spambayes.Options import options

class FileCorpus(Corpus.Corpus):

    def __init__(self, factory, directory, filter='*', cacheSize=250):
        '''Constructor(FileMessageFactory, corpus directory name, fnmatch
filter'''

        Corpus.Corpus.__init__(self, factory, cacheSize)

        self.directory = directory
        self.filter = filter

        # This assumes that the directory exists.  A horrible death occurs
        # otherwise. We *could* simply create it, but that will likely only
        # mask errors

        # This will not pick up any changes to the corpus that are made
        # through the file system. The key list is established in __init__,
        # and if anybody stores files in the directory, even if they match
        # the filter, they won't make it into the key list.  The same
        # problem exists if anybody removes files. This *could* be a problem.
        # If so, we can maybe override the keys() method to account for this,
        # but there would be training side-effects...  The short of it is that
        # corpora that are managed by FileCorpus should *only* be managed by
        # FileCorpus (at least for now).  External changes that must be made
        # to the corpus should for the moment be handled by a complete
        # retraining.

        for filename in os.listdir(directory):
            if fnmatch.fnmatch(filename, filter):
               self.msgs[filename] = None

    def makeMessage(self, key):
        '''Ask our factory to make a Message'''

        msg = self.factory.create(key, self.directory)

        return msg

    def addMessage(self, message):
        '''Add a Message to this corpus'''

        if not fnmatch.fnmatch(message.key(), self.filter):
            raise ValueError

        if options.verbose:
            print 'adding',message.key(),'to corpus'

        message.directory = self.directory
        message.store()
        # superclass processing *MUST* be done
        # perform superclass processing *LAST!*
        Corpus.Corpus.addMessage(self, message)

    def removeMessage(self, message):
        '''Remove a Message from this corpus'''

        if options.verbose:
            print 'removing',message.key(),'from corpus'

        message.remove()

        # superclass processing *MUST* be done
        # perform superclass processing *LAST!*
        Corpus.Corpus.removeMessage(self, message)

    def __repr__(self):
        '''Instance as a representative string'''

        nummsgs = len(self.msgs)
        if nummsgs != 1:
            s = 's'
        else:
            s = ''

        if options.verbose and nummsgs > 0:
            lst = ', ' + '%s' % (self.keys())
        else:
            lst = ''

        return "<%s object at %8.8x, directory: %s, %s message%s%s>" % \
            (self.__class__.__name__, \
            id(self), \
            self.directory, \
            nummsgs, s, lst)

class ExpiryFileCorpus(Corpus.ExpiryCorpus, FileCorpus):
    '''FileCorpus of "young" file system artifacts'''

    def __init__(self, expireBefore, factory, directory, filter='*', cacheSize=250):
        '''Constructor(FileMessageFactory, corpus directory name, fnmatch
filter'''

        Corpus.ExpiryCorpus.__init__(self, expireBefore)
        FileCorpus.__init__(self, factory, directory, filter, cacheSize)

class FileMessage(Corpus.Message):
    '''Message that persists as a file system artifact.'''

    def __init__(self,file_name, directory):
        '''Constructor(message file name, corpus directory name)'''

        Corpus.Message.__init__(self)
        self.file_name = file_name
        self.directory = directory

        # No calling of self.load() here - that's done on demand by
        # Message.__getattr__.

    def pathname(self):
        '''Derive the pathname of the message file'''

        return os.path.join(self.directory, self.file_name)

    def load(self):
        '''Read the Message substance from the file'''

        if options.verbose:
            print 'loading', self.file_name

        pn = self.pathname()
        try:
            fp = open(pn, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT:
               raise
        else:
           self.setSubstance(fp.read())
           fp.close()

    def store(self):
        '''Write the Message substance to the file'''

        if options.verbose:
            print 'storing', self.file_name

        pn = self.pathname()
        fp = open(pn, 'wb')
        fp.write(self.getSubstance())
        fp.close()

    def remove(self):
        '''Message hara-kiri'''

        if options.verbose:
            print 'physically deleting file',self.pathname()

        os.unlink(self.pathname())

    def name(self):
        '''A unique name for the message'''
        return self.file_name

    def key(self):
        '''The key of this message in the msgs dictionary'''
        return self.file_name

    def __repr__(self):
        '''Instance as a representative string'''

        elip = ''
        sub = self.getSubstance()

        if options.verbose:
            sub = self.getSubstance()
        else:
            if len(sub) > 20:
                sub = sub[:20]
                if len(sub) > 40:
                    sub += '...' + sub[-20:]

        pn = os.path.join(self.directory, self.file_name)

        return "<%s object at %8.8x, file: %s, %s>" % \
            (self.__class__.__name__, \
            id(self), pn, sub)

    def __str__(self):
        '''Instance as a printable string'''

        return self.__repr__()

    def createTimestamp(self):
        '''Return the create timestamp for the file'''

        stats = os.stat(self.pathname())
        ctime = stats[stat.ST_CTIME]

        return ctime

class FileMessageFactory(Corpus.MessageFactory):
    '''MessageFactory for FileMessage objects'''

    def create(self, key, directory):
        '''Create a message object from a filename in a directory'''

        return FileMessage(key, directory)

class GzipFileMessage(FileMessage):
    '''Message that persists as a zipped file system artifact.'''

    def load(self):
        '''Read the Message substance from the file'''

        if options.verbose:
            print 'loading', self.file_name

        pn = self.pathname()

        try:
            fp = gzip.open(pn, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT:
                raise
        else:
            self.setSubstance(fp.read())
            fp.close()

    def store(self):
        '''Write the Message substance to the file'''

        if options.verbose:
            print 'storing', self.file_name

        pn = self.pathname()
        gz = gzip.open(pn, 'wb')
        gz.write(self.getSubstance())
        gz.flush()
        gz.close()

class GzipFileMessageFactory(FileMessageFactory):
    '''MessageFactory for FileMessage objects'''

    def create(self, key, directory):
        '''Create a message object from a filename in a directory'''

        return GzipFileMessage(key, directory)

def runTest(useGzip):

    print 'Executing Test'

    if useGzip:
        fmFact = GzipFileMessageFactory()
        print 'Executing with Gzipped files'
    else:
        fmFact =  FileMessageFactory()
        print 'Executing with uncompressed files'

    print '\n\nCreating two Classifier databases'
    miscbayes = storage.PickledClassifier('fctestmisc.bayes')
    classbayes = storage.DBDictClassifier('fctestclass.bayes')

    print '\n\nSetting up spam corpus'
    spamcorpus = FileCorpus(fmFact, 'fctestspamcorpus')
    spamtrainer = storage.SpamTrainer(miscbayes)
    spamcorpus.addObserver(spamtrainer)
    anotherspamtrainer = storage.SpamTrainer(classbayes, storage.UPDATEPROBS)
    spamcorpus.addObserver(anotherspamtrainer)

    keys = spamcorpus.keys()
    keys.sort()
    for key in keys:                          # iterate the list of keys
        msg = spamcorpus[key]                 # corpus is a dictionary
        spamtrainer.train(msg)
        anotherspamtrainer.train(msg)

    print '\n\nSetting up ham corpus'
    hamcorpus = FileCorpus(fmFact, \
                          'fctesthamcorpus', \
                          'MSG*')
    hamtrainer = storage.HamTrainer(miscbayes)
    hamcorpus.addObserver(hamtrainer)
    hamtrainer.trainAll(hamcorpus)

    print '\n\nA couple of message related tests'
    if useGzip:
        fmClass = GzipFileMessage
    else:
        fmClass = FileMessage

    m1 = fmClass('XMG00001', 'fctestspamcorpus')
    m1.setSubstance(testmsg2())

    print '\n\nAdd a message to hamcorpus that does not match the filter'

    try:
        hamcorpus.addMessage(m1)
    except ValueError:
        print 'Add failed, test passed'
    else:
        print 'Add passed, test failed'

    print '\n\nThis is the hamcorpus'
    print hamcorpus

    print '\n\nThis is the spamcorpus'
    print spamcorpus

    print '\n\nSetting up unsure corpus'
    # the unsure corpus is an expiry corpus with five second expiry
    # and a cache size of 2 (for testing purposes only...), and
    # no trainers, since there's no such thing as 'unsure training'
    unsurecorpus = ExpiryFileCorpus(5, fmFact, \
                                    'fctestunsurecorpus', 'MSG*', 2)
    unsurecorpus.removeExpiredMessages()

    print '\n\nIterate the unsure corpus twice, to make sure cache size \
is managed correctly, and to make sure iteration is repeatable. \
We should not see MSG00003 in this iteration.'
    for msg in unsurecorpus:
        print msg.key()    # don't print msg, too much information
    print '...and again'
    for msg in unsurecorpus:
        print msg.key()    # don't print msg, too much information

    print '\n\nRemoving expired messages from unsure corpus.'
    unsurecorpus.removeExpiredMessages()

    print '\n\nTrain with an individual message'
    anotherhamtrainer = storage.HamTrainer(classbayes)
    anotherhamtrainer.train(unsurecorpus['MSG00005'])

    print '\n\nMoving msg00002 from spamcorpus to hamcorpus'
    hamcorpus.takeMessage('MSG00002', spamcorpus)   # Oops. made a mistake...

    print "\n\nLet's test printing a message"
    msg = spamcorpus['MSG00001']
    print msg
    print '\n\nThis is some vital information in the message'
    print 'Date header is',msg.getDate()
    print 'Subject header is',msg.getSubject()
    print 'From header is',msg.getFrom()

    print 'Header text is:',msg.getHeaders()
    print 'Headers are:',msg.getHeadersList()
    print 'Body is:',msg.getPayload()

    print '\n\nClassifying messages in unsure corpus'

    for msg in unsurecorpus:
        prob = classbayes.spamprob(msg.tokenize())

        print 'Message %s spam probability is %f' % (msg.key(), prob)

        if prob < options.ham_cutoff:
            print 'Moving %s from unsurecorpus to hamcorpus, \
based on prob of %f' % (msg.key(), prob)
            hamcorpus.takeMessage(msg.key(), unsurecorpus)
        elif prob > options.spam_cutoff:
            print 'Moving %s from unsurecorpus to spamcorpus, \
based on prob of %f' % (msg.key(), prob)
            spamcorpus.takeMessage(msg.key(), unsurecorpus)

    print '\n\nThis is the new hamcorpus'
    print hamcorpus

    print '\n\nThis is the new spamcorpus'
    print spamcorpus

    print '\n\nThis is the new unsurecorpus'
    print unsurecorpus
    print 'unsurecorpus cache contains', unsurecorpus.keysInMemory
    print 'unsurecorpus msgs dict contains', unsurecorpus.msgs

    print '\n\nStoring bayes databases'
    miscbayes.store()
    classbayes.store()

def cleanupTest():

    print 'Cleaning up'

    cleanupDirectory('fctestspamcorpus')
    cleanupDirectory('fctesthamcorpus')
    cleanupDirectory('fctestunsurecorpus')

    if not useExistingDB:
        try:
            os.unlink('fctestmisc.bayes')
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise

        try:
            os.unlink('fctestclass.bayes')
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise

def cleanupDirectory(dirname):

    try:
        flist = os.listdir(dirname)
    except OSError, e:
        if e.errno != 3:     # errno.<WHAT>
           raise
    else:
        for filename in os.listdir(dirname):
            fn = os.path.join(dirname, filename)
            os.unlink(fn)
    try:
        os.rmdir(dirname)
    except OSError, e:
        if e.errno != 2:     # errno.<WHAT>
            raise

def setupTest(useGzip):

    cleanupTest()

    print 'Setting up test'

    # no try blocks here, because if any of this dies, the test
    # cannot proceed

    os.mkdir('fctestspamcorpus')
    os.mkdir('fctesthamcorpus')
    os.mkdir('fctestunsurecorpus')

    tm1 = testmsg1()
    tm2 = testmsg2()

    if useGzip:
        fmClass = GzipFileMessage
    else:
        fmClass = FileMessage

    m1 = fmClass('MSG00001', 'fctestspamcorpus')
    m1.setSubstance(tm1)
    m1.store()

    m2 = fmClass('MSG00002', 'fctestspamcorpus')
    m2.setSubstance(tm2)
    m2.store()

    m3 = fmClass('MSG00003', 'fctestunsurecorpus')
    m3.setSubstance(tm1)
    m3.store()

    for x in range(11):
       time.sleep(1)    # make sure MSG00003 has expired
       if 10-x == 1:
           s = ''
       else:
           s = 's'
       print 'wait',10-x,'more second%s' % (s)

    m4 = fmClass('MSG00004', 'fctestunsurecorpus')
    m4.setSubstance(tm1)
    m4.store()

    m5 = fmClass('MSG00005', 'fctestunsurecorpus')
    m5.setSubstance(tm2)
    m5.store()

    m6 = fmClass('MSG00006', 'fctestunsurecorpus')
    m6.setSubstance(tm2)
    m6.store()

def testmsg1():

    return """
X-Hd:skip at pobox.com Mon Nov 04 10:50:49 2002
Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
1998/05/13) Mon Nov 4 08:50:58 2002)
X-From_:skip at mojam.com Mon Nov 4 08:49:03 2002
Return-Path:<skip at mojam.com>
Delivered-To:timstone at mail.powweb.com
Received:from manatee.mojam.com (manatee.mojam.com [199.249.165.175]) by
mail.powweb.com (Postfix) with ESMTP id DC95A1BB1D0 for
<tim at fourstonesExpressions.com>; Mon, 4 Nov 2002 08:49:02 -0800 (PST)
Received:from montanaro.dyndns.org (12-248-11-90.client.attbi.com
[12.248.11.90]) by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id
gA4Gn0oY029655 for <tim at fourstonesExpressions.com>; Mon, 4 Nov 2002 10:49:00
-0600
Received:from montanaro.dyndns.org (localhost [127.0.0.1]) by
montanaro.dyndns.org (8.12.2/8.12.2) with ESMTP id gA4Gn3cP015572 for
<tim at fourstonesExpressions.com>; Mon, 4 Nov 2002 10:49:03 -0600 (CST)
Received:(from skip at localhost) by montanaro.dyndns.org (8.12.2/8.12.2/Submit)
id gA4Gn37l015569; Mon, 4 Nov 2002 10:49:03 -0600 (CST)
From:Skip Montanaro <skip at pobox.com>
MIME-Version:1.0
Content-Type:text/plain; charset=us-ascii
Content- Transfer- Encoding:7bit
Message-ID:<15814.42238.882013.702030 at montanaro.dyndns.org>
Date:Mon, 4 Nov 2002 10:49:02 -0600
To:Four Stones Expressions <tim at fourstonesExpressions.com>
Subject:Reformat mail to 80 columns?
In-Reply-To:<QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1 at riven>
References:<8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc at riven>
<QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1 at riven>
X-Mailer:VM 7.07 under 21.5 (beta9) "brussels sprouts" XEmacs Lucid
Reply-To:skip at pobox.com
X-Hammie- Disposition:Unsure

11/4/2002 10:49:02 AM, Skip Montanaro <skip at pobox.com> wrote:

>(off-list)
>
>Tim,
>
>Any chance you can easily generate messages to the spambayes list which wrap
>at something between 70 and 78 columns?  I find I have to always edit your
>messages to read them easily.
>
>Thanks,
>
>--
>Skip Montanaro - skip at pobox.com
>http://www.mojam.com/
>http://www.musi-cal.com/
>
>
- Tim
www.fourstonesExpressions.com """

def testmsg2():
    return """
X-Hd:richie at entrian.com Wed Nov 06 12:05:41 2002
Received:by mail.powweb.com (mbox timstone) (with Cubic Circle's cucipop (v1.31
1998/05/13) Wed Nov 6 10:05:45 2002)
X-From_:richie at entrian.com Wed Nov 6 10:05:33 2002
Return-Path:<richie at entrian.com>
Delivered-To:timstone at mail.powweb.com
Received:from anchor-post-31.mail.demon.net (anchor-post-31.mail.demon.net
[194.217.242.89]) by mail.powweb.com (Postfix) with ESMTP id 3DC431BB06A for
<tim at fourstonesexpressions.com>; Wed, 6 Nov 2002 10:05:33 -0800 (PST)
Received:from sundog.demon.co.uk ([158.152.226.183]) by
anchor-post-31.mail.demon.net with smtp (Exim 3.35 #1) id 189UYP-000IAw-0V for
tim at fourstonesExpressions.com; Wed, 06 Nov 2002 18:05:25 +0000
From:Richie Hindle <richie at entrian.com>
To:tim at fourstonesExpressions.com
Subject:Re: What to call this training stuff
Date:Wed, 06 Nov 2002 18:05:56 +0000
Organization:entrian.com
Reply-To:richie at entrian.com
Message-ID:<d0hisugn3nau4m704kotgpd4jlt33rvrda at 4ax.com>
References:<IFWRHE041VTXW72JGDBD0RTS04YTGE.3dc933a1 at riven>
In-Reply-To:<IFWRHE041VTXW72JGDBD0RTS04YTGE.3dc933a1 at riven>
X-Mailer:Forte Agent 1.7/32.534
MIME-Version:1.0
Content-Type:text/plain; charset=us-ascii
Content- Transfer- Encoding:7bit
X-Hammie- Disposition:Unsure

Hi Tim,

> Richie, I think we should package these classes I've been writing as
> 'corpusManagement.py'  What we're really doing here is creating a set of
tools
> that can be used to manage corpi (?) corpusses (?)  corpae (?)  whatever...
of
> messages.

Good plan.  Minor point of style: mixed-case module names (like class
names) tend to have an initial capital: CorpusManagement.py

On the name... sorry to disagree about names again, but what does the word
'management' add?  This is a module for manipulating corpuses, so I reckon
it should be called Corpus.  Like Cookie, gzip, zipfile, locale, mailbox...
see what I mean?

--
Richie Hindle
richie at entrian.com"""

if __name__ == '__main__':

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'estgvhcu')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    options.verbose = False
    runTestServer = False
    setupTestServer = False
    cleanupTestServer = False
    useGzip = False
    useExistingDB = False

    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-s':
            setupTestServer = True
        elif opt == '-e':
            runTestServer = True
        elif opt == '-t':
            setupTestServer = True
            runTestServer = True
        elif opt == '-c':
            cleanupTestServer = True
        elif opt == '-v':
            options.verbose = True
        elif opt == '-g':
            useGzip = True
        elif opt == '-u':
            useExistingDB = True
        elif opt == '-v':
            options.verbose = True

    if setupTestServer:
        setupTest(useGzip)
    if runTestServer:
        runTest(useGzip)
    elif cleanupTestServer:
        cleanupTest()
    else:
        print >>sys.stderr, __doc__

--- NEW FILE: Histogram.py ---
#! /usr/bin/env python
import math

from spambayes.Options import options

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

class Hist:
    """Simple histograms of float values."""

    # Pass None for lo and hi and it will automatically adjust to the min
    # and max values seen.
    # Note:  nbuckets can be passed for backward compatibility.  The
    # display() method can be passed a different nbuckets value.
    def __init__(self, nbuckets=options.nbuckets,  lo=0.0, hi=100.0):
        self.lo, self.hi = lo, hi
        self.nbuckets = nbuckets
        self.buckets = [0] * nbuckets
        self.data = []  # the raw data points
        self.stats_uptodate = False

    # Add a value to the collection.
    def add(self, x):
        self.data.append(x)
        self.stats_uptodate = False

    # Compute, and set as instance attrs:
    #     n         # of data points
    # The rest are set iff n>0:
    #     min       smallest value in collection
    #     max       largest value in collection
    #     median    midpoint
    #     mean
    #     pct       list of (percentile, score) pairs
    #     var       variance
    #     sdev      population standard deviation (sqrt(variance))
    # self.data is also sorted.
    def compute_stats(self):
        if self.stats_uptodate:
            return
        self.stats_uptodate = True
        data = self.data
        n = self.n = len(data)
        if n == 0:
            return
        data.sort()
        self.min = data[0]
        self.max = data[-1]
        if n & 1:
            self.median = data[n // 2]
        else:
            self.median = (data[n // 2] + data[(n-1) // 2]) / 2.0
        # Compute mean.
        # Add in increasing order of magnitude, to minimize roundoff error.
        if data[0] < 0.0:
            temp = [(abs(x), x) for x in data]
            temp.sort()
            data = [x[1] for x in temp]
            del temp
        sum = 0.0
        for x in data:
            sum += x
        mean = self.mean = sum / n
        # Compute variance.
        var = 0.0
        for x in data:
            d = x - mean
            var += d*d
        self.var = var / n
        self.sdev = math.sqrt(self.var)
        # Compute percentiles.
        self.pct = pct = []
        for p in options.percentiles:
            assert 0.0 <= p <= 100.0
            # In going from data index 0 to index n-1, we move n-1 times.
            # p% of that is (n-1)*p/100.
            i = (n-1)*p/1e2
            if i < 0:
                # Just return the smallest.
                score = data[0]
            else:
                whole = int(i)
                frac = i - whole
                score = data[whole]
                if whole < n-1 and frac:
                    # Move frac of the way from this score to the next.
                    score += frac * (data[whole + 1] - score)
            pct.append((p, score))

    # Merge other into self.
    def __iadd__(self, other):
        self.data.extend(other.data)
        self.stats_uptodate = False
        return self

    def get_lo_hi(self):
        self.compute_stats()
        lo, hi = self.lo, self.hi
        if lo is None:
            lo = self.min
        if hi is None:
            hi = self.max
        return lo, hi

    def get_bucketwidth(self):
        lo, hi = self.get_lo_hi()
        span = float(hi - lo)
        return span / self.nbuckets

    # Set instance var nbuckets to the # of buckets, and buckets to a list
    # of nbuckets counts.
    def fill_buckets(self, nbuckets=None):
        if nbuckets is None:
            nbuckets = self.nbuckets
        if nbuckets <= 0:
            raise ValueError("nbuckets %g > 0 required" % nbuckets)
        self.nbuckets = nbuckets
        self.buckets = buckets = [0] * nbuckets

        # Compute bucket counts.
        lo, hi = self.get_lo_hi()
        bucketwidth = self.get_bucketwidth()
        for x in self.data:
            i = int((x - lo) / bucketwidth)
            if i >= nbuckets:
                i = nbuckets - 1
            elif i < 0:
                i = 0
            buckets[i] += 1

    # Print a histogram to stdout.
    # Also sets instance var nbuckets to the # of buckets, and
    # buckts to a list of nbuckets counts, but only if at least one
    # data point is in the collection.
    def display(self, nbuckets=None, WIDTH=61):
        if nbuckets is None:
            nbuckets = self.nbuckets
        if nbuckets <= 0:
            raise ValueError("nbuckets %g > 0 required" % nbuckets)
        self.compute_stats()
        n = self.n
        if n == 0:
            return
        print "%d items; mean %.2f; sdev %.2f" % (n, self.mean, self.sdev)
        print "-> <stat> min %g; median %g; max %g" % (self.min,
                                                       self.median,
                                                       self.max)
        pcts = ['%g%% %g' % x for x in self.pct]
        print "-> <stat> percentiles:", '; '.join(pcts)

        lo, hi = self.get_lo_hi()
        if lo > hi:
            return

        # hunit is how many items a * represents.  A * is printed for
        # each hunit items, plus any non-zero fraction thereof.
        self.fill_buckets(nbuckets)
        biggest = max(self.buckets)
        hunit, r = divmod(biggest, WIDTH)
        if r:
            hunit += 1
        print "* =", hunit, "items"

        # We need ndigits decimal digits to display the largest bucket count.
        ndigits = len(str(biggest))

        # Displaying the bucket boundaries is more troublesome.
        bucketwidth = self.get_bucketwidth()
        whole_digits = max(len(str(int(lo))),
                           len(str(int(hi - bucketwidth))))
        frac_digits = 0
        while bucketwidth < 1.0:
            # Incrementing by bucketwidth may not change the last displayed
            # digit, so display one more.
            frac_digits += 1
            bucketwidth *= 10.0
        format = ("%" + str(whole_digits + 1 + frac_digits) + '.' +
                  str(frac_digits) + 'f %' + str(ndigits) + "d")

        bucketwidth = self.get_bucketwidth()
        for i in range(nbuckets):
            n = self.buckets[i]
            print format % (lo + i * bucketwidth, n),
            print '*' * ((n + hunit - 1) // hunit)

--- NEW FILE: Options.py ---
# Options.options is a globally shared options object.

# XXX As this code is, option names must be unique across ini sections,
# XXX and must not conflict with OptionsClass method names.

import sys, os
import StringIO
import ConfigParser
from sets import Set

try:
    True, False, bool
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
    def bool(val):
        return not not val

__all__ = ['options']

defaults = """
[Tokenizer]
# If true, tokenizer.Tokenizer.tokenize_headers() will tokenize the
# contents of each header field just like the text of the message
# body, using the name of the header as a tag.  Tokens look like
# "header:word".  The basic approach is simple and effective, but also
# very sensitive to biases in the ham and spam collections.  For
# example, if the ham and spam were collected at different times,
# several headers with date/time information will become the best
# discriminators.  (Not just Date, but Received and X-From_.)
basic_header_tokenize: False

# If true and basic_header_tokenize is also true, then
# basic_header_tokenize is the only action performed.
basic_header_tokenize_only: False

# If basic_header_tokenize is true, then basic_header_skip is a set of
# headers that should be skipped.
basic_header_skip: received
    date
    x-.*

# If true, the first few characters of application/octet-stream sections
# are used, undecoded.  What 'few' means is decided by octet_prefix_size.
check_octets: False
octet_prefix_size: 5

# Generate tokens just counting the number of instances of each kind of
# header line, in a case-sensitive way.
#
# Depending on data collection, some headers are not safe to count.
# For example, if ham is collected from a mailing list but spam from your
# regular inbox traffic, the presence of a header like List-Info will be a
# very strong ham clue, but a bogus one.  In that case, set
# count_all_header_lines to False, and adjust safe_headers instead.
count_all_header_lines: False

# When True, generate a "noheader:HEADERNAME" token for each header in
# safe_headers (below) that *doesn't* appear in the headers.  This helped
# in various of Tim's python.org tests, but appeared to hurt a little in
# Anthony Baxter's tests.
record_header_absence: False

# Like count_all_header_lines, but restricted to headers in this list.
# safe_headers is ignored when count_all_header_lines is true, unless
# record_header_absence is also true.
safe_headers: abuse-reports-to
    date
    errors-to
    from
    importance
    in-reply-to
    message-id
    mime-version
    organization
    received
    reply-to
    return-path
    subject
    to
    user-agent
    x-abuse-info
    x-complaints-to
    x-face

# A lot of clues can be gotten from IP addresses and names in Received:
# headers.  Again this can give spectacular results for bogus reasons
# if your test corpora are from different sources.  Else set this to true.
mine_received_headers: False

# Mine the following address headers. If you have mixed source corpuses
# (as opposed to a mixed sauce walrus, which is delicious!) then you
# probably don't want to use 'to' or 'cc')
# Address headers will be decoded, and will generate charset tokens as
# well as the real address.
# others to consider: to, cc, reply-to, errors-to, sender, ...
address_headers: from

# If legitimate mail contains things that look like text to the tokenizer
# and turning turning off this option helps (perhaps binary attachments get
# 'defanged' by something upstream from this operation and thus look like
# text), this may help, and should be an alert that perhaps the tokenizer is
# broken.
generate_long_skips: True

# Try to capitalize on mail sent to multiple similar addresses.
summarize_email_prefixes: False

#
# Length of words that triggers 'long skips'. Longer than this
# triggers a skip.
#
skip_max_word_size: 12

# Generate tokens which resemble the posting time in 10-minute buckets:
#     'time:'  hour  ':'  minute//10
generate_time_buckets: False

# Extract day of the week tokens from the Date: header.
extract_dow: False

# If true, replace high-bit characters (ord(c) >= 128) and control characters
# with question marks.  This allows non-ASCII character strings to be
# identified with little training and small database burden.  It's appropriate
# only if your ham is plain 7-bit ASCII, or nearly so, so that the mere
# presence of non-ASCII character strings is known in advance to be a strong
# spam indicator.
replace_nonascii_chars: False

[TestDriver]
# These control various displays in class TestDriver.Driver, and Tester.Test.

# spam_cutoff and ham_cutoff are used in Python slice sense:
#    A msg is considered    ham if its score is in 0:ham_cutoff
#    A msg is considered unsure if its score is in ham_cutoff:spam_cutoff
#    A msg is considered   spam if its score is in spam_cutoff:
#
# So it's unsure iff  ham_cutoff <= score < spam_cutoff.
# For a binary classifier, make ham_cutoff == spam_cutoff.
# ham_cutoff > spam_cutoff doesn't make sense.
#
# The defaults here (.2 and .9) may be appropriate for the default chi-
# combining scheme.  Cutoffs for chi-combining typically aren't touchy,
# provided you're willing to settle for "really good" instead of "optimal".
# Tim found that .3 and .8 worked very well for well-trained systems on
# his personal email, and his large comp.lang.python test.  If just beginning
# training, or extremely fearful of mistakes, 0.05 and 0.95 may be more
# appropriate for you.
#
# Picking good values for gary-combining is much harder, and appears to be
# corpus-dependent, and within a single corpus dependent on how much
# training has been done.  Values from 0.50 thru the low 0.60's have been
# reported to work best by various testers on their data.
ham_cutoff:  0.20
spam_cutoff: 0.90

# Number of buckets in histograms.
nbuckets: 200
show_histograms: True

# After the display of a ham+spam histogram pair, you can get a listing of
# all the cutoff values (coinciding with histogram bucket boundaries) that
# minimize
#
#      best_cutoff_fp_weight * (# false positives) +
#      best_cutoff_fn_weight * (# false negatives) +
#      best_cutoff_unsure_weight * (# unsure msgs)
#
# This displays two cutoffs:  hamc and spamc, where
#
#     0.0 <= hamc <= spamc <= 1.0
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
# called 'I am not sure' -- the middle ground.
#
# Note:  You may wish to increase nbuckets, to give this scheme more cutoff
# values to analyze.
compute_best_cutoffs_from_histograms: True
best_cutoff_fp_weight:     10.00
best_cutoff_fn_weight:      1.00
best_cutoff_unsure_weight:  0.20

# Histogram analysis also displays percentiles.  For each percentile p
# in the list, the score S such that p% of all scores are <= S is given.
# Note that percentile 50 is the median, and is displayed (along with the
# min score and max score) independent of this option.
percentiles: 5 25 75 95

# Display spam when
#     show_spam_lo <= spamprob <= show_spam_hi
# and likewise for ham.  The defaults here do not show anything.
show_spam_lo: 1.0
show_spam_hi: 0.0
show_ham_lo: 1.0
show_ham_hi: 0.0

show_false_positives: True
show_false_negatives: False
show_unsure: False

# The maximum # of characters to display for a msg displayed due to the
# show_xyz options above.
show_charlimit: 3000

# If save_trained_pickles is true, Driver.train() saves a binary pickle
# of the classifier after training.  The file basename is given by
# pickle_basename, the extension is .pik, and increasing integers are
# appended to pickle_basename.  By default (if save_trained_pickles is
# true), the filenames are class1.pik, class2.pik, ...  If a file of that
# name already exists, it is overwritten.  pickle_basename is ignored when
# save_trained_pickles is false.

# if save_histogram_pickles is true, Driver.train() saves a binary
# pickle of the spam and ham histogram for "all test runs". The file
# basename is given by pickle_basename, the suffix _spamhist.pik
# or _hamhist.pik is appended  to the basename.

save_trained_pickles: False
pickle_basename: class
save_histogram_pickles: False

# default locations for timcv and timtest - these get the set number
# interpolated.
spam_directories: Data/Spam/Set%d
ham_directories: Data/Ham/Set%d

[CV Driver]
# A cross-validation driver takes N ham+spam sets, and builds N classifiers,
# training each on N-1 sets, and the predicting against the set not trained
# on.  By default, it does this in a clever way, learning *and* unlearning
# sets as it goes along, so that it never needs to train on N-1 sets in one
# gulp after the first time.  Setting this option true forces ''one gulp
# from-scratch'' training every time.  There used to be a set of combining
# schemes that needed this, but now it is just in case you are paranoid <wink>.
build_each_classifier_from_scratch: False

[Classifier]
# The maximum number of extreme words to look at in a msg, where "extreme"
# means with spamprob farthest away from 0.5.  150 appears to work well
# across all corpora tested.
max_discriminators: 150

# These two control the prior assumption about word probabilities.
# unknown_word_prob is essentially the probability given to a word that
# has never been seen before.  Nobody has reported an improvement via moving
# it away from 1/2, although Tim has measured a mean spamprob of a bit over
# 0.5 (0.51-0.55) in 3 well-trained classifiers.
#
# unknown_word_strength adjusts how much weight to give the prior assumption
# relative to the probabilities estimated by counting.  At 0, the counting
# estimates are believed 100%, even to the extent of assigning certainty
# (0 or 1) to a word that has appeared in only ham or only spam.  This
# is a disaster.
#
# As unknown_word_strength tends toward infintity, all probabilities tend
# toward unknown_word_prob.  All reports were that a value near 0.4 worked
# best, so this does not seem to be corpus-dependent.
unknown_word_prob: 0.5
unknown_word_strength: 0.45

# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
# tests.  0.1 appeared to work well across all corpora.
minimum_prob_strength: 0.1

# The combining scheme currently detailed on the Robinon web page.
# The middle ground here is touchy, varying across corpus, and within
# a corpus across amounts of training data.  It almost never gives extreme
# scores (near 0.0 or 1.0), but the tail ends of the ham and spam
# distributions overlap.
use_gary_combining: False

# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
# follows the chi-squared distribution with 2*n degrees of freedom.  This is
# the "provably most-sensitive" test the original scheme was monotonic
# with.  Getting closer to the theoretical basis appears to give an excellent
# combining method, usually very extreme in its judgment, yet finding a tiny
# (in # of msgs, spread across a huge range of scores) middle ground where
# lots of the mistakes live.  This is the best method so far.
# One systematic benefit is is immunity to "cancellation disease".  One
# systematic drawback is sensitivity to *any* deviation from a
# uniform distribution, regardless of whether actually evidence of
# ham or spam.  Rob Hooft alleviated that by combining the final S and H
# measures via (S-H+1)/2 instead of via S/(S+H)).
# In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95,
# does well across test sets; while these cutoffs are rarely optimal, they
# get close to optimal.  With more training data, Tim has had good luck
# with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
# (original c.l.p data, his own email, and newer general python.org traffic).
use_chi_squared_combining: True

# If the # of ham and spam in training data are out of balance, the
# spamprob guesses can get stronger in the direction of the category with
# more training msgs.  In one sense this must be so, since the more data
# we have of one flavor, the more we know about that flavor.  But that
# allows the accidental appearance of a strong word of that flavor in a msg
# of the other flavor much more power than an accident in the other
# direction.  Enable experimental_ham_spam_imbalance_adjustment if you have
# more ham than spam training data (or more spam than ham), and the
# Bayesian probability adjustment won't 'believe' raw counts more than
# min(# ham trained on, # spam trained on) justifies.  I *expect* this
# option will go away (and become the default), but people *with* strong
# imbalance need to test it first.
experimental_ham_spam_imbalance_adjustment: False

[Hammie]
# The name of the header that hammie adds to an E-mail in filter mode
# It contains the "classification" of the mail, plus the score.
hammie_header_name: X-Spambayes-Classification

# The three disposition names are added to the header as the following
# Three words:
header_spam_string: spam
header_ham_string: ham
header_unsure_string: unsure

# Accuracy of the score in the header in decimal digits
header_score_digits: 2

# Set this to "True", to augment scores of 1.00 or 0.00 by a logarithmic
# "one-ness" or "zero-ness" score (basically it shows the "number of zeros"
# or "number of nines" next to the score value).
header_score_logarithm: False

# Enable debugging information in the header.
hammie_debug_header: False

# Name of a debugging header for spambayes hackers, showing the strongest
# clues that have resulted in the classification in the standard header.
hammie_debug_header_name: X-Hammie-Debug

# The range of clues that are added to the "debug" header in the E-mail
# All clues that have their probability smaller than this number, or larger
# than one minus this number are added to the header such that you can see
# why spambayes thinks this is ham/spam or why it is unsure. The default is
# to show all clues, but you can reduce that by setting showclue to a lower
# value, such as 0.1
clue_mailheader_cutoff: 0.5

[hammiefilter]
# hammiefilter can use either a database (quick to score one message) or
# a pickle (quick to train on huge amounts of messages). Set this to
# True to use a database by default.
hammiefilter_persistent_use_database: True
hammiefilter_persistent_storage_file: ~/.hammiedb

[pop3proxy]
# pop3proxy settings - pop3proxy also respects the options in the Hammie
# section, with the exception of the extra header details at the moment.
# The only mandatory option is pop3proxy_servers, eg. "pop3.my-isp.com:110",
# or a comma-separated list of those.  The ":110" is optional.  If you
# specify more than one server in pop3proxy_servers, you must specify the
# same number of ports in pop3proxy_ports.
pop3proxy_servers:
pop3proxy_ports:
pop3proxy_cache_use_gzip: False
pop3proxy_cache_expiry_days: 7
pop3proxy_spam_cache: pop3proxy-spam-cache
pop3proxy_ham_cache: pop3proxy-ham-cache
pop3proxy_unknown_cache: pop3proxy-unknown-cache
pop3proxy_persistent_use_database: False
pop3proxy_persistent_storage_file: hammie.db

# Deprecated - use pop3proxy_servers and pop3proxy_ports instead.
pop3proxy_server_name:
pop3proxy_server_port: 110
pop3proxy_port: 110

[html_ui]
html_ui_port: 8880
html_ui_launch_browser: False

[globals]
verbose: False
# What DBM storage type should we use?  Must be best, db3hash, dbhash,
# gdbm, dumbdbm.  Windows folk should steer clear of dbhash.  Default is
# "best", which will pick the best DBM type available on your platform.
dbm_type: best
"""

int_cracker = ('getint', None)
float_cracker = ('getfloat', None)
boolean_cracker = ('getboolean', bool)
string_cracker = ('get', None)

all_options = {
    'Tokenizer': {'safe_headers': ('get', lambda s: Set(s.split())),
                  'address_headers': ('get', lambda s: Set(s.split())),
                  'count_all_header_lines': boolean_cracker,
                  'record_header_absence': boolean_cracker,
                  'generate_long_skips': boolean_cracker,
                  'summarize_email_prefixes': boolean_cracker,
                  'skip_max_word_size': int_cracker,
                  'extract_dow': boolean_cracker,
                  'generate_time_buckets': boolean_cracker,
                  'mine_received_headers': boolean_cracker,
                  'check_octets': boolean_cracker,
                  'octet_prefix_size': int_cracker,
                  'basic_header_tokenize': boolean_cracker,
                  'basic_header_tokenize_only': boolean_cracker,
                  'basic_header_skip': ('get', lambda s: Set(s.split())),
                  'replace_nonascii_chars': boolean_cracker,
                 },
    'TestDriver': {'nbuckets': int_cracker,
                   'show_ham_lo': float_cracker,
                   'show_ham_hi': float_cracker,
                   'show_spam_lo': float_cracker,
                   'show_spam_hi': float_cracker,
                   'show_false_positives': boolean_cracker,
                   'show_false_negatives': boolean_cracker,
                   'show_unsure': boolean_cracker,
                   'show_histograms': boolean_cracker,
                   'percentiles': ('get', lambda s: map(float, s.split())),
                   'save_trained_pickles': boolean_cracker,
                   'save_histogram_pickles': boolean_cracker,
                   'pickle_basename': string_cracker,
                   'show_charlimit': int_cracker,
                   'ham_cutoff': float_cracker,
                   'spam_cutoff': float_cracker,
                   'spam_directories': string_cracker,
                   'ham_directories': string_cracker,
                   'compute_best_cutoffs_from_histograms': boolean_cracker,
                   'best_cutoff_fp_weight': float_cracker,
                   'best_cutoff_fn_weight': float_cracker,
                   'best_cutoff_unsure_weight': float_cracker,
                  },
    'CV Driver': {'build_each_classifier_from_scratch': boolean_cracker,
                 },
    'Classifier': {'max_discriminators': int_cracker,
                   'unknown_word_prob': float_cracker,
                   'unknown_word_strength': float_cracker,
                   'minimum_prob_strength': float_cracker,
                   'use_gary_combining': boolean_cracker,
                   'use_chi_squared_combining': boolean_cracker,
                   'experimental_ham_spam_imbalance_adjustment': boolean_cracker,
                  },
    'Hammie': {'hammie_header_name': string_cracker,
               'clue_mailheader_cutoff': float_cracker,
               'persistent_use_database': boolean_cracker,
               'header_spam_string': string_cracker,
               'header_unsure_string': string_cracker,
               'header_ham_string': string_cracker,
               'header_score_digits': int_cracker,
               'header_score_logarithm': boolean_cracker,
               'hammie_debug_header': boolean_cracker,
               'hammie_debug_header_name': string_cracker,
               },
    'hammiefilter' : {'hammiefilter_persistent_use_database': boolean_cracker,
                      'hammiefilter_persistent_storage_file': string_cracker,
                      },
    'pop3proxy': {'pop3proxy_servers': string_cracker,
                  'pop3proxy_ports': string_cracker,
                  'pop3proxy_server_name': string_cracker,
                  'pop3proxy_server_port': int_cracker,
                  'pop3proxy_port': int_cracker,
                  'pop3proxy_cache_use_gzip': boolean_cracker,
                  'pop3proxy_cache_expiry_days': int_cracker,
                  'pop3proxy_spam_cache': string_cracker,
                  'pop3proxy_ham_cache': string_cracker,
                  'pop3proxy_unknown_cache': string_cracker,
                  'pop3proxy_persistent_use_database': boolean_cracker,
                  'pop3proxy_persistent_storage_file': string_cracker,
                  },
    'html_ui': {'html_ui_port': int_cracker,
                'html_ui_launch_browser': boolean_cracker,
                },
    'globals': {'verbose': boolean_cracker,
                'dbm_type': string_cracker,
                },
}

def _warn(msg):
    print >> sys.stderr, msg

class OptionsClass(object):
    def __init__(self):
        self._config = ConfigParser.ConfigParser()

    def mergefiles(self, fnamelist):
        self._config.read(fnamelist)
        self._update()

    def mergefilelike(self, filelike):
        self._config.readfp(filelike)
        self._update()

    def _update(self):
        nerrors = 0
        c = self._config
        for section in c.sections():
            if section not in all_options:
                _warn("config file has unknown section %r" % section)
                nerrors += 1
                continue
            goodopts = all_options[section]
            for option in c.options(section):
                if option not in goodopts:
                    _warn("config file has unknown option %r in "
                         "section %r" % (option, section))
                    nerrors += 1
                    continue
                fetcher, converter = goodopts[option]
                value = getattr(c, fetcher)(section, option)
                if converter is not None:
                    value = converter(value)
                setattr(options, option, value)
        if nerrors:
            raise ValueError("errors while parsing .ini file")

    def display(self):
        output = StringIO.StringIO()
        self._config.write(output)
        return output.getvalue()

options = OptionsClass()

d = StringIO.StringIO(defaults)
options.mergefilelike(d)
del d

alternate = None
if hasattr(os, 'getenv'):
    alternate = os.getenv('BAYESCUSTOMIZE')
if alternate:
    options.mergefiles(alternate.split())
else:
    options.mergefiles(['bayescustomize.ini'])

--- NEW FILE: TestDriver.py ---
# Loop:
#     Optional:
#         # Set up a new base classifier for testing.
#         new_classifier(), or set_classifier()
#     # Run tests against (possibly variants of) this classifier.
#     Loop:
#         Loop:
#             Optional:
#                 # train on more ham and spam
#                 train(ham, spam)
#             Optional:
#                 # Forget training for some subset of ham and spam.
#                 untrain(ham, spam)
#         # Predict against other data.
#         Loop:
#             test(ham, spam)
#         # Display stats against all runs on this classifier variant.
#         # This also saves the trained classifer, if desired (option
#         # save_trained_pickles).
#         finishtest()
# # Display stats against all runs.
# alldone()

from sets import Set
import cPickle as pickle
from heapq import heapreplace

from spambayes.Options import options
from spambayes import Tester
from spambayes import classifier
from spambayes.Histogram import Hist

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

def printhist(tag, ham, spam, nbuckets=options.nbuckets):
    print
    print "-> <stat> Ham scores for", tag,
    ham.display(nbuckets)

    print
    print "-> <stat> Spam scores for", tag,
    spam.display(nbuckets)

    if not options.compute_best_cutoffs_from_histograms:
        return
    if ham.n == 0 or spam.n == 0:
        return

    # Figure out "the best" ham & spam cutoff points, meaning the ones that
    # minimize
    #    num_fp * fp_weight + num_fn + fn_weight + num_unsure * unsure_weight
    # the total number of misclassified msgs (other definitions are
    # certainly possible!).

    # At cutoff 0, everything is called spam, so there are no false negatives,
    # and every ham is a false positive.
    assert ham.nbuckets == spam.nbuckets
    n = ham.nbuckets
    FPW = options.best_cutoff_fp_weight
    FNW = options.best_cutoff_fn_weight
    UNW = options.best_cutoff_unsure_weight

    # Get running totals:  {h,s}total[i] is # of ham/spam below bucket i
    htotal = [0] * (n+1)
    stotal = [0] * (n+1)
    for i in range(1, n+1):
        htotal[i] = htotal[i-1] + ham.buckets[i-1]
        stotal[i] = stotal[i-1] + spam.buckets[i-1]
    assert htotal[-1] == ham.n
    assert stotal[-1] == spam.n

    best_cost = 1e200   # infinity
    bests = []          # best h and s cutoffs

    for h in range(n+1):
        num_fn = stotal[h]
        fn_cost = num_fn * FNW
        for s in xrange(h, n+1):
            # ham  0:h  correct
            #      h:s  unsure
            #      s:   FP
            # spam 0:h  FN
            #      h:s  unsure
            #      s:   correct
            num_fp = htotal[-1] - htotal[s]
            num_un = htotal[s] - htotal[h] + stotal[s] - stotal[h]
            cost = num_fp * FPW + fn_cost + num_un * UNW
            if cost <= best_cost:
                if cost < best_cost:
                    best_cost = cost
                    bests = []
                bests.append((h, s))

    print '-> best cost for %s $%.2f' % (tag, best_cost)
    print '-> per-fp cost $%.2f; per-fn cost $%.2f; per-unsure cost $%.2f' % (
          FPW, FNW, UNW)

    if len(bests) > 1:
        print '-> achieved at', len(bests), 'cutoff pairs'
        info = [('smallest ham & spam cutoffs', bests[0]),
                ('largest ham & spam cutoffs', bests[-1])]
    else:
        info = [('achieved at ham & spam cutoffs', bests[0])]

    for tag, (h, s) in info:
        print '-> %s %g & %g' % (tag, float(h)/n, float(s)/n)
        num_fn = stotal[h]
        num_fp = htotal[-1] - htotal[s]
        num_unh = htotal[s] - htotal[h]
        num_uns = stotal[s] - stotal[h]
        print '->     fp %d; fn %d; unsure ham %d; unsure spam %d' % (
              num_fp, num_fn, num_unh, num_uns)
        print '->     fp rate %.3g%%; fn rate %.3g%%; unsure rate %.3g%%' % (
              num_fp*1e2 / ham.n, num_fn*1e2 / spam.n,
              (num_unh + num_uns)*1e2 / (ham.n + spam.n))

    return float(bests[0][0])/n,float(bests[0][1])/n

def printmsg(msg, prob, clues):
    print msg.tag
    print "prob =", prob
    for clue in clues:
        print "prob(%r) = %g" % clue
    print
    guts = str(msg)
    if options.show_charlimit > 0:
        guts = guts[:options.show_charlimit]
    print guts

class Driver:

    def __init__(self):
        self.falsepos = Set()
        self.falseneg = Set()
        self.unsure = Set()
        self.global_ham_hist = Hist()
        self.global_spam_hist = Hist()
        self.ntimes_finishtest_called = 0
        self.new_classifier()
        from spambayes import CostCounter
        self.cc=CostCounter.default()

    def new_classifier(self):
        """Create and use a new, virgin classifier."""
        self.set_classifier(classifier.Bayes())

    def set_classifier(self, classifier):
        """Specify a classifier to be used for further testing."""
        self.classifier = classifier
        self.tester = Tester.Test(classifier)
        self.trained_ham_hist = Hist()
        self.trained_spam_hist = Hist()

    def train(self, ham, spam):
        print "-> Training on", ham, "&", spam, "...",
        c = self.classifier
        nham, nspam = c.nham, c.nspam
        self.tester.train(ham, spam)
        print c.nham - nham, "hams &", c.nspam- nspam, "spams"

    def untrain(self, ham, spam):
        print "-> Forgetting", ham, "&", spam, "...",
        c = self.classifier
        nham, nspam = c.nham, c.nspam
        self.tester.untrain(ham, spam)
        print nham - c.nham, "hams &", nspam - c.nspam, "spams"

    def finishtest(self):
        if options.show_histograms:
            printhist("all in this training set:",
                      self.trained_ham_hist, self.trained_spam_hist)
        self.global_ham_hist += self.trained_ham_hist
        self.global_spam_hist += self.trained_spam_hist
        self.trained_ham_hist = Hist()
        self.trained_spam_hist = Hist()

        self.ntimes_finishtest_called += 1
        if options.save_trained_pickles:
            fname = "%s%d.pik" % (options.pickle_basename,
                                  self.ntimes_finishtest_called)
            print "    saving pickle to", fname
            fp = file(fname, 'wb')
            pickle.dump(self.classifier, fp, 1)
            fp.close()

    def alldone(self):
        if options.show_histograms:
            besthamcut,bestspamcut = printhist("all runs:", 
                                               self.global_ham_hist, 
                                               self.global_spam_hist)
        else:
            besthamcut = options.ham_cutoff
            bestspamcut = options.spam_cutoff
        nham = self.global_ham_hist.n
        nspam = self.global_spam_hist.n
        nfp = len(self.falsepos)
        nfn = len(self.falseneg)
        nun = len(self.unsure)
        print "-> <stat> all runs false positives:", nfp
        print "-> <stat> all runs false negatives:", nfn
        print "-> <stat> all runs unsure:", nun
        print "-> <stat> all runs false positive %:", (nfp * 1e2 / nham)
        print "-> <stat> all runs false negative %:", (nfn * 1e2 / nspam)
        print "-> <stat> all runs unsure %:", (nun * 1e2 / (nham + nspam))
        print "-> <stat> all runs cost: $%.2f" % (
              nfp * options.best_cutoff_fp_weight +
              nfn * options.best_cutoff_fn_weight +
              nun * options.best_cutoff_unsure_weight)
        # Set back the options for the delayed calculations in self.cc
        options.ham_cutoff = besthamcut
        options.spam_cutoff = bestspamcut
        print self.cc

        if options.save_histogram_pickles:
            for f, h in (('ham', self.global_ham_hist),
                         ('spam', self.global_spam_hist)):
                fname = "%s_%shist.pik" % (options.pickle_basename, f)
                print "    saving %s histogram pickle to %s" %(f, fname)
                fp = file(fname, 'wb')
                pickle.dump(h, fp, 1)
                fp.close()

    def test(self, ham, spam):
        c = self.classifier
        t = self.tester
        local_ham_hist = Hist()
        local_spam_hist = Hist()

        def new_ham(msg, prob, lo=options.show_ham_lo,
                               hi=options.show_ham_hi):
            local_ham_hist.add(prob * 100.0)
            self.cc.ham(prob)
            if lo <= prob <= hi:
                print
                print "Ham with prob =", prob
                prob, clues = c.spamprob(msg, True)
                printmsg(msg, prob, clues)

        def new_spam(msg, prob, lo=options.show_spam_lo,
                                hi=options.show_spam_hi):
            local_spam_hist.add(prob * 100.0)
            self.cc.spam(prob)
            if lo <= prob <= hi:
                print
                print "Spam with prob =", prob
                prob, clues = c.spamprob(msg, True)
                printmsg(msg, prob, clues)

        t.reset_test_results()
        print "-> Predicting", ham, "&", spam, "..."
        t.predict(spam, True, new_spam)
        t.predict(ham, False, new_ham)
        print "-> <stat> tested", t.nham_tested, "hams &", t.nspam_tested, \
              "spams against", c.nham, "hams &", c.nspam, "spams"

        print "-> <stat> false positive %:", t.false_positive_rate()
        print "-> <stat> false negative %:", t.false_negative_rate()
        print "-> <stat> unsure %:", t.unsure_rate()
        print "-> <stat> cost: $%.2f" % (
               t.nham_wrong * options.best_cutoff_fp_weight +
               t.nspam_wrong * options.best_cutoff_fn_weight +
               (t.nham_unsure + t.nspam_unsure) *
               options.best_cutoff_unsure_weight)

        newfpos = Set(t.false_positives()) - self.falsepos
        self.falsepos |= newfpos
        print "-> <stat> %d new false positives" % len(newfpos)
        if newfpos:
            print "    new fp:", [e.tag for e in newfpos]
        if not options.show_false_positives:
            newfpos = ()
        for e in newfpos:
            print '*' * 78
            prob, clues = c.spamprob(e, True)
            printmsg(e, prob, clues)

        newfneg = Set(t.false_negatives()) - self.falseneg
        self.falseneg |= newfneg
        print "-> <stat> %d new false negatives" % len(newfneg)
        if newfneg:
            print "    new fn:", [e.tag for e in newfneg]
        if not options.show_false_negatives:
            newfneg = ()
        for e in newfneg:
            print '*' * 78
            prob, clues = c.spamprob(e, True)
            printmsg(e, prob, clues)

        newunsure = Set(t.unsures()) - self.unsure
        self.unsure |= newunsure
        print "-> <stat> %d new unsure" % len(newunsure)
        if newunsure:
            print "    new unsure:", [e.tag for e in newunsure]
        if not options.show_unsure:
            newunsure = ()
        for e in newunsure:
            print '*' * 78
            prob, clues = c.spamprob(e, True)
            printmsg(e, prob, clues)

        if options.show_histograms:
            printhist("this pair:", local_ham_hist, local_spam_hist)
        self.trained_ham_hist += local_ham_hist
        self.trained_spam_hist += local_spam_hist

--- NEW FILE: Tester.py ---
from spambayes.Options import options

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

class Test:
    # Pass a classifier instance (an instance of Bayes).
    # Loop:
    #     # Train the classifer with new ham and spam.
    #     train(ham, spam) # this implies reset_test_results
    #     Loop:
    #         Optional:
    #             # Possibly fiddle the classifier.
    #             set_classifier()
    #             # Forget smessages the classifier was trained on.
    #             untrain(ham, spam) # this implies reset_test_results
    #         Optional:
    #             reset_test_results()
    #         # Predict against (presumably new) examples.
    #         predict(ham, spam)
    #         Optional:
    #             suck out the results, via instance vrbls and
    #             false_negative_rate(), false_positive_rate(),
    #             false_negatives(), and false_positives()

    def __init__(self, classifier):
        self.set_classifier(classifier)
        self.reset_test_results()

    # Tell the tester which classifier to use.
    def set_classifier(self, classifier):
        self.classifier = classifier

    def reset_test_results(self):
        # The number of ham and spam instances tested.
        self.nham_tested = self.nspam_tested = 0

        # The number of test instances correctly and incorrectly classified.
        self.nham_right = 0
        self.nham_wrong = 0
        self.nham_unsure = 0;
        self.nspam_right = 0
        self.nspam_wrong = 0
        self.nspam_unsure = 0;

        # Lists of bad predictions.
        self.ham_wrong_examples = []    # False positives:  ham called spam.
        self.spam_wrong_examples = []   # False negatives:  spam called ham.
        self.unsure_examples = []       # ham and spam in middle ground

    # Train the classifier on streams of ham and spam.  Updates probabilities
    # before returning, and resets test results.
    def train(self, hamstream=None, spamstream=None):
        self.reset_test_results()
        learn = self.classifier.learn
        if hamstream is not None:
            for example in hamstream:
                learn(example, False)
        if spamstream is not None:
            for example in spamstream:
                learn(example, True)

    # Untrain the classifier on streams of ham and spam.  Updates
    # probabilities before returning, and resets test results.
    def untrain(self, hamstream=None, spamstream=None):
        self.reset_test_results()
        unlearn = self.classifier.unlearn
        if hamstream is not None:
            for example in hamstream:
                unlearn(example, False)
        if spamstream is not None:
            for example in spamstream:
                unlearn(example, True)

    # Run prediction on each sample in stream.  You're swearing that stream
    # is entirely composed of spam (is_spam True), or of ham (is_spam False).
    # Note that mispredictions are saved, and can be retrieved later via
    # false_negatives (spam mistakenly called ham) and false_positives (ham
    # mistakenly called spam).  For this reason, you may wish to wrap examples
    # in a little class that identifies the example in a useful way, and whose
    # __iter__ produces a token stream for the classifier.
    #
    # If specified, callback(msg, spam_probability) is called for each
    # msg in the stream, after the spam probability is computed.
    def predict(self, stream, is_spam, callback=None):
        guess = self.classifier.spamprob
        for example in stream:
            prob = guess(example)
            if callback:
                callback(example, prob)
            is_ham_guessed  = prob <  options.ham_cutoff
            is_spam_guessed = prob >= options.spam_cutoff
            if is_spam:
                self.nspam_tested += 1
                if is_spam_guessed:
                    self.nspam_right += 1
                elif is_ham_guessed:
                    self.nspam_wrong += 1
                    self.spam_wrong_examples.append(example)
                else:
                    self.nspam_unsure += 1
                    self.unsure_examples.append(example)
            else:
                self.nham_tested += 1
                if is_ham_guessed:
                    self.nham_right += 1
                elif is_spam_guessed:
                    self.nham_wrong += 1
                    self.ham_wrong_examples.append(example)
                else:
                    self.nham_unsure += 1
                    self.unsure_examples.append(example)

        assert (self.nham_right + self.nham_wrong + self.nham_unsure ==
                self.nham_tested)
        assert (self.nspam_right + self.nspam_wrong + self.nspam_unsure ==
                self.nspam_tested)

    def false_positive_rate(self):
        """Percentage of ham mistakenly identified as spam, in 0.0..100.0."""
        return self.nham_wrong * 1e2 / (self.nham_tested or 1)

    def false_negative_rate(self):
        """Percentage of spam mistakenly identified as ham, in 0.0..100.0."""
        return self.nspam_wrong * 1e2 / (self.nspam_tested or 1)

    def unsure_rate(self):
        return ((self.nham_unsure + self.nspam_unsure) * 1e2 /
                ((self.nham_tested + self.nspam_tested) or 1))

    def false_positives(self):
        return self.ham_wrong_examples

    def false_negatives(self):
        return self.spam_wrong_examples

    def unsures(self):
        return self.unsure_examples

class _Example:
    def __init__(self, name, words):
        self.name = name
        self.words = words
    def __iter__(self):
        return iter(self.words)

_easy_test = """
    >>> from spambayes.classifier import Bayes
    >>> from spambayes.Options import options
    >>> options.ham_cutoff = options.spam_cutoff = 0.5

    >>> good1 = _Example('', ['a', 'b', 'c'])
    >>> good2 = _Example('', ['a', 'b'])
    >>> bad1 = _Example('', ['c', 'd'])

    >>> t = Test(Bayes())
    >>> t.train([good1, good2], [bad1])
    >>> t.predict([_Example('goodham', ['a', 'b']),
    ...            _Example('badham', ['d'])    # FP
    ...           ], False)
    >>> t.predict([_Example('goodspam', ['d']),
    ...            _Example('badspam1', ['a']), # FN
    ...            _Example('badspam2', ['a', 'b']),    # FN
    ...            _Example('badspam3', ['d', 'a', 'b'])    # FN
    ...           ], True)

    >>> t.nham_tested
    2
    >>> t.nham_right, t.nham_wrong
    (1, 1)
    >>> t.false_positive_rate()
    50.0
    >>> [e.name for e in t.false_positives()]
    ['badham']

    >>> t.nspam_tested
    4
    >>> t.nspam_right, t.nspam_wrong
    (1, 3)
    >>> t.false_negative_rate()
    75.0
    >>> [e.name for e in t.false_negatives()]
    ['badspam1', 'badspam2', 'badspam3']

    >>> [e.name for e in t.unsures()]
    []
    >>> t.unsure_rate()
    0.0
"""

__test__ = {'easy': _easy_test}

def _test():
    import doctest, Tester
    doctest.testmod(Tester)

if __name__ == '__main__':
    _test()

--- NEW FILE: __init__.py ---
# package marker.

--- NEW FILE: cdb.py ---
#! /usr/bin/env python
"""
Dan Bernstein's CDB implemented in Python

see http://cr.yp.to/cdb.html

"""

from __future__ import generators

import os
import struct
import mmap
import sys

def uint32_unpack(buf):
    return struct.unpack('<L', buf)[0]

def uint32_pack(n):
    return struct.pack('<L', n)

CDB_HASHSTART = 5381

def cdb_hash(buf):
    h = CDB_HASHSTART
    for c in buf:
        h = (h + (h << 5)) & 0xffffffffL
        h ^= ord(c)
    return h

class Cdb(object):

    def __init__(self, fp):
        self.fp = fp
        fd = fp.fileno()
        self.size = os.fstat(fd).st_size
        self.map = mmap.mmap(fd, self.size, access=mmap.ACCESS_READ)
        self.eod = uint32_unpack(self.map[:4])
        self.findstart()
        self.loop = 0 # number of hash slots searched under this key
        # initialized if loop is nonzero
        self.khash = 0
        self.hpos = 0
        self.hslots = 0
        # initialized if findnext() returns 1
        self.dpos = 0
        self.dlen = 0

    def close(self):
        self.map.close()

    def __iter__(self, fn=None):
        len = 2048
        ret = []
        while len < self.eod:
            klen, vlen = struct.unpack("<LL", self.map[len:len+8])
            len += 8
            key = self.map[len:len+klen]
            len += klen
            val = self.map[len:len+vlen]
            len += vlen
            if fn:
                yield fn(key, val)
            else:
                yield (key, val)

    def iteritems(self):
        return self.__iter__()

    def iterkeys(self):
        return self.__iter__(lambda k,v: k)

    def itervalues(self):
        return self.__iter__(lambda k,v: v)

    def items(self):
        ret = []
        for i in self.iteritems():
            ret.append(i)
        return ret

    def keys(self):
        ret = []
        for i in self.iterkeys():
            ret.append(i)
        return ret

    def values(self):
        ret = []
        for i in self.itervalues():
            ret.append(i)
        return ret

    def findstart(self):
        self.loop = 0

    def read(self, n, pos):
        # XXX add code for platforms without mmap
        return self.map[pos:pos+n]

    def match(self, key, pos):
        if key == self.read(len(key), pos):
            return 1
        else:
            return 0

    def findnext(self, key):
        if not self.loop:
            u = cdb_hash(key)
            buf = self.read(8, u << 3 & 2047)
            self.hslots = uint32_unpack(buf[4:])
            if not self.hslots:
                raise KeyError
            self.hpos = uint32_unpack(buf[:4])
            self.khash = u
            u >>= 8
            u %= self.hslots
            u <<= 3
            self.kpos = self.hpos + u

        while self.loop < self.hslots:
            buf = self.read(8, self.kpos)
            pos = uint32_unpack(buf[4:])
            if not pos:
                raise KeyError
            self.loop += 1
            self.kpos += 8
            if self.kpos == self.hpos + (self.hslots << 3):
                self.kpos = self.hpos
            u = uint32_unpack(buf[:4])
            if u == self.khash:
                buf = self.read(8, pos)
                u = uint32_unpack(buf[:4])
                if u == len(key):
                    if self.match(key, pos + 8):
                        dlen = uint32_unpack(buf[4:])
                        dpos = pos + 8 + len(key)
                        return self.read(dlen, dpos)
        raise KeyError

    def __getitem__(self, key):
        self.findstart()
        return self.findnext(key)

    def get(self, key, default=None):
        self.findstart()
        try:
            return self.findnext(key)
        except KeyError:
            return default

def cdb_dump(infile):
    """dump a database in djb's cdbdump format"""
    db = Cdb(infile)
    for key,value in db.iteritems():
        print "+%d,%d:%s->%s" % (len(key), len(value), key, value)
    print

def cdb_make(outfile, items):
    pos = 2048
    tables = {} # { h & 255 : [(h, p)] }

    # write keys and data
    outfile.seek(pos)
    for key, value in items:
        outfile.write(uint32_pack(len(key)) + uint32_pack(len(value)))
        h = cdb_hash(key)
        outfile.write(key)
        outfile.write(value)
        tables.setdefault(h & 255, []).append((h, pos))
        pos += 8 + len(key) + len(value)

    final = ''
    # write hash tables
    for i in range(256):
        entries = tables.get(i, [])
        nslots = 2*len(entries)
        final += uint32_pack(pos) + uint32_pack(nslots)
        null = (0, 0)
        table = [null] * nslots
        for h, p in entries:
            n = (h >> 8) % nslots
            while table[n] is not null:
                n = (n + 1) % nslots
            table[n] = (h, p)
        for h, p in table:
            outfile.write(uint32_pack(h) + uint32_pack(p))
            pos += 8

    # write header (pointers to tables and their lengths)
    outfile.flush()
    outfile.seek(0)
    outfile.write(final)

def test():
    #db = Cdb(open("t"))
    #print db['one']
    #print db['two']
    #print db['foo']
    #print db['us']
    #print db.get('ec')
    #print db.get('notthere')
    db = open('test.cdb', 'wb')
    cdb_make(db,
             [('one', 'Hello'),
              ('two', 'Goodbye'),
              ('foo', 'Bar'),
              ('us', 'United States'),
              ])
    db.close()
    db = Cdb(open("test.cdb", 'rb'))
    print db['one']
    print db['two']
    print db['foo']
    print db['us']
    print db.get('ec')
    print db.get('notthere')

if __name__ == '__main__':
    test()

--- NEW FILE: chi2.py ---
import math as _math

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

def chi2Q(x2, v, exp=_math.exp, min=min):
    """Return prob(chisq >= x2, with v degrees of freedom).

    v must be even.
    """
    assert v & 1 == 0
    # XXX If x2 is very large, exp(-m) will underflow to 0.
    m = x2 / 2.0
    sum = term = exp(-m)
    for i in range(1, v//2):
        term *= m / i
        sum += term
    # With small x2 and large v, accumulated roundoff error, plus error in
    # the platform exp(), can cause this to spill a few ULP above 1.0.  For
    # example, chi2Q(100, 300) on my box has sum == 1.0 + 2.0**-52 at this
    # point.  Returning a value even a teensy bit over 1.0 is no good.
    return min(sum, 1.0)

def normZ(z, sqrt2pi=_math.sqrt(2.0*_math.pi), exp=_math.exp):
    "Return value of the unit Gaussian at z."
    return exp(-z*z/2.0) / sqrt2pi

def normP(z):
    """Return area under the unit Gaussian from -inf to z.

    This is the probability that a zscore is <= z.
    """

    # This is very accurate in a fixed-point sense.  For negative z of
    # large magnitude (<= -8.3), it returns 0.0, essentially because
    # P(-z) is, to machine precision, indistiguishable from 1.0 then.

    # sum <- area from 0 to abs(z).
    a = abs(float(z))
    if a >= 8.3:
        sum = 0.5
    else:
        sum2 = term = a * normZ(a)
        z2 = a*a
        sum = 0.0
        i = 1.0
        while sum != sum2:
            sum = sum2
            i += 2.0
            term *= z2 / i
            sum2 += term

    if z >= 0:
        result = 0.5 + sum
    else:
        result = 0.5 - sum

    return result

def normIQ(p, sqrt=_math.sqrt, ln=_math.log):
    """Return z such that the area under the unit Gaussian from z to +inf is p.

    Must have 0.0 <= p <= 1.0.
    """

    assert 0.0 <= p <= 1.0
    # This is a low-accuracy rational approximation from Abramowitz & Stegun.
    # The absolute error is bounded by 3e-3.

    flipped = False
    if p > 0.5:
        flipped = True
        p = 1.0 - p

    if p == 0.0:
        z = 8.3
    else:
        t = sqrt(-2.0 * ln(p))
        z = t - (2.30753 + .27061*t) / (1. + .99229*t + .04481*t**2)

    if flipped:
        z = -z
    return z

def normIP(p):
    """Return z such that the area under the unit Gaussian from -inf to z is p.

    Must have 0.0 <= p <= 1.0.
    """
    z = normIQ(1.0 - p)
    # One Newton step should double the # of good digits.
    return z + (p - normP(z)) / normZ(z)

def main():
    from spambayes.Histogram import Hist
    import sys

    class WrappedRandom:
        # There's no way W-H is equidistributed in 50 dimensions, so use
        # Marsaglia-wrapping to shuffle it more.

        def __init__(self, baserandom=random.random, tabsize=513):
            self.baserandom = baserandom
            self.n = tabsize
            self.tab = [baserandom() for i in range(tabsize)]
            self.next = baserandom()

        def random(self):
            result = self.next
            i = int(result * self.n)
            self.next = self.tab[i]
            self.tab[i] = self.baserandom()
            return result

    random = WrappedRandom().random
    #from uni import uni as random
    #print random

    def judge(ps, ln=_math.log, ln2=_math.log(2), frexp=_math.frexp):
        H = S = 1.0
        Hexp = Sexp = 0
        for p in ps:
            S *= 1.0 - p
            H *= p
            if S < 1e-200:
                S, e = frexp(S)
                Sexp += e
            if H < 1e-200:
                H, e = frexp(H)
                Hexp += e
        S = ln(S) + Sexp * ln2
        H = ln(H) + Hexp * ln2
        n = len(ps)
        S = 1.0 - chi2Q(-2.0 * S, 2*n)
        H = 1.0 - chi2Q(-2.0 * H, 2*n)
        return S, H, (S-H + 1.0) / 2.0

    warp = 0
    bias = 0.99
    if len(sys.argv) > 1:
        warp = int(sys.argv[1])
    if len(sys.argv) > 2:
        bias = float(sys.argv[2])

    h = Hist(20, lo=0.0, hi=1.0)
    s = Hist(20, lo=0.0, hi=1.0)
    score = Hist(20, lo=0.0, hi=1.0)

    for i in range(5000):
        ps = [random() for j in range(50)]
        s1, h1, score1 = judge(ps + [bias] * warp)
        s.add(s1)
        h.add(h1)
        score.add(score1)

    print "Result for random vectors of 50 probs, +", warp, "forced to", bias

    # Should be uniformly distributed on all-random data.
    print
    print 'H',
    h.display()

    # Should be uniformly distributed on all-random data.
    print
    print 'S',
    s.display()

    # Distribution doesn't really matter.
    print
    print '(S-H+1)/2',
    score.display()

def showscore(ps, ln=_math.log, ln2=_math.log(2), frexp=_math.frexp):
    H = S = 1.0
    Hexp = Sexp = 0
    for p in ps:
        S *= 1.0 - p
        H *= p
        if S < 1e-200:
            S, e = frexp(S)
            Sexp += e
        if H < 1e-200:
            H, e = frexp(H)
            Hexp += e
    S = ln(S) + Sexp * ln2
    H = ln(H) + Hexp * ln2

    n = len(ps)
    probS = chi2Q(-2*S, 2*n)
    probH = chi2Q(-2*H, 2*n)
    print "P(chisq >= %10g | v=%3d) = %10g" % (-2*S, 2*n, probS)
    print "P(chisq >= %10g | v=%3d) = %10g" % (-2*H, 2*n, probH)

    S = 1.0 - probS
    H = 1.0 - probH
    score = (S-H + 1.0) / 2.0
    print "spam prob", S
    print " ham prob", H
    print "(S-H+1)/2", score

if __name__ == '__main__':
    import random
    main()

--- NEW FILE: classifier.py ---
#! /usr/bin/env python
# An implementation of a Bayes-like spam classifier.
#
# Paul Graham's original description:
#
#     http://www.paulgraham.com/spam.html
#
# A highly fiddled version of that can be retrieved from our CVS repository,
# via tag Last-Graham.  This made many demonstrated improvements in error
# rates over Paul's original description.
#
# This code implements Gary Robinson's suggestions, the core of which are
# well explained on his webpage:
#
#    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
#
# This is theoretically cleaner, and in testing has performed at least as
# well as our highly tuned Graham scheme did, often slightly better, and
# sometimes much better.  It also has "a middle ground", which people like:
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct.  The false positives
# and false negatives under Gary's basic scheme (use_gary_combining) generally
# score in a narrow range around the corpus's best spam_cutoff value.
# However, it doesn't appear possible to guess the best spam_cutoff value in
# advance, and it's touchy.
#
# The chi-combining scheme used by default here gets closer to the theoretical
# basis of Gary's combining scheme, and does give extreme scores, but also
# has a very useful middle ground (small # of msgs spread across a large range
# of scores, and good cutoff values aren't touchy).
#
# This implementation is due to Tim Peters et alia.

import math
from sets import Set

from spambayes.Options import options
from spambayes.chi2 import chi2Q

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

LN2 = math.log(2)       # used frequently by chi-combining

PICKLE_VERSION = 5

class WordInfo(object):
    # Invariant:  For use in a classifier database, at least one of
    # spamcount and hamcount must be non-zero.

    def __init__(self):
        self.__setstate__((0, 0))

    def __repr__(self):
        return "WordInfo%r" % repr((self.spamcount,
                                    self.hamcount))

    def __getstate__(self):
        return (self.spamcount,
                self.hamcount)

    def __setstate__(self, t):
        (self.spamcount, self.hamcount) = t

class Classifier:
    # Defining __slots__ here made Jeremy's life needlessly difficult when
    # trying to hook this all up to ZODB as a persistent object.  There's
    # no space benefit worth getting from slots in this class; slots were
    # used solely to help catch errors earlier, when this code was changing
    # rapidly.

    #__slots__ = ('wordinfo',  # map word to WordInfo record
    #             'nspam',     # number of spam messages learn() has seen
    #             'nham',      # number of non-spam messages learn() has seen
    #            )

    # allow a subclass to use a different class for WordInfo
    WordInfoClass = WordInfo

    def __init__(self):
        self.wordinfo = {}
        self.probcache = {}
        self.nspam = self.nham = 0

    def __getstate__(self):
        return (PICKLE_VERSION, self.wordinfo, self.nspam, self.nham)

    def __setstate__(self, t):
        if t[0] != PICKLE_VERSION:
            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
        (self.wordinfo, self.nspam, self.nham) = t[1:]
        self.probcache = {}

    # spamprob() implementations.  One of the following is aliased to
    # spamprob, depending on option settings.

    def gary_spamprob(self, wordstream, evidence=False):
        """Return best-guess probability that wordstream is spam.

        wordstream is an iterable object producing words.
        The return value is a float in [0.0, 1.0].

        If optional arg evidence is True, the return value is a pair
            probability, evidence
        where evidence is a list of (word, probability) pairs.
        """

        from math import frexp

        # This combination method is due to Gary Robinson; see
        # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

        # The real P = this P times 2**Pexp.  Likewise for Q.  We're
        # simulating unbounded dynamic float range by hand.  If this pans
        # out, *maybe* we should store logarithms in the database instead
        # and just add them here.  But I like keeping raw counts in the
        # database (they're easy to understand, manipulate and combine),
        # and there's no evidence that this simulation is a significant
        # expense.
        P = Q = 1.0
        Pexp = Qexp = 0
        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            P *= 1.0 - prob
            Q *= prob
            if P < 1e-200:  # move back into range
                P, e = frexp(P)
                Pexp += e
            if Q < 1e-200:  # move back into range
                Q, e = frexp(Q)
                Qexp += e

        P, e = frexp(P)
        Pexp += e
        Q, e = frexp(Q)
        Qexp += e

        num_clues = len(clues)
        if num_clues:
            #P = 1.0 - P**(1./num_clues)
            #Q = 1.0 - Q**(1./num_clues)
            #
            # (x*2**e)**n = x**n * 2**(e*n)
            n = 1.0 / num_clues
            P = 1.0 - P**n * 2.0**(Pexp * n)
            Q = 1.0 - Q**n * 2.0**(Qexp * n)

            # (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives
            # ((P-Q)/(P+Q)+1)/2 =
            # ((P-Q+P-Q)/(P+Q)/2 =
            # (2*P/(P+Q)/2 =
            # P/(P+Q)
            prob = P/(P+Q)
        else:
            prob = 0.5

        if evidence:
            clues = [(w, p) for p, w, r in clues]
            clues.sort(lambda a, b: cmp(a[1], b[1]))
            return prob, clues
        else:
            return prob

    if options.use_gary_combining:
        spamprob = gary_spamprob

    # Across vectors of length n, containing random uniformly-distributed
    # probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution
    # with 2*n degrees of freedom.  This has been proven (in some
    # appropriate sense) to be the most sensitive possible test for
    # rejecting the hypothesis that a vector of probabilities is uniformly
    # distributed.  Gary Robinson's original scheme was monotonic *with*
    # this test, but skipped the details.  Turns out that getting closer
    # to the theoretical roots gives a much sharper classification, with
    # a very small (in # of msgs), but also very broad (in range of scores),
    # "middle ground", where most of the mistakes live.  In particular,
    # this scheme seems immune to all forms of "cancellation disease":  if
    # there are many strong ham *and* spam clues, this reliably scores
    # close to 0.5.  Most other schemes are extremely certain then -- and
    # often wrong.
    def chi2_spamprob(self, wordstream, evidence=False):
        """Return best-guess probability that wordstream is spam.

        wordstream is an iterable object producing words.
        The return value is a float in [0.0, 1.0].

        If optional arg evidence is True, the return value is a pair
            probability, evidence
        where evidence is a list of (word, probability) pairs.
        """

        from math import frexp, log as ln

        # We compute two chi-squared statistics, one for ham and one for
        # spam.  The sum-of-the-logs business is more sensitive to probs
        # near 0 than to probs near 1, so the spam measure uses 1-p (so
        # that high-spamprob words have greatest effect), and the ham
        # measure uses p directly (so that lo-spamprob words have greatest
        # effect).
        #
        # For optimization, sum-of-logs == log-of-product, and f.p.
        # multiplication is a lot cheaper than calling ln().  It's easy
        # to underflow to 0.0, though, so we simulate unbounded dynamic
        # range via frexp.  The real product H = this H * 2**Hexp, and
        # likewise the real product S = this S * 2**Sexp.
        H = S = 1.0
        Hexp = Sexp = 0

        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            S *= 1.0 - prob
            H *= prob
            if S < 1e-200:  # prevent underflow
                S, e = frexp(S)
                Sexp += e
            if H < 1e-200:  # prevent underflow
                H, e = frexp(H)
                Hexp += e

        # Compute the natural log of the product = sum of the logs:
        # ln(x * 2**i) = ln(x) + i * ln(2).
        S = ln(S) + Sexp * LN2
        H = ln(H) + Hexp * LN2

        n = len(clues)
        if n:
            S = 1.0 - chi2Q(-2.0 * S, 2*n)
            H = 1.0 - chi2Q(-2.0 * H, 2*n)

            # How to combine these into a single spam score?  We originally
            # used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H).  A
            # systematic problem is that we could end up being near-certain
            # a thing was (for example) spam, even if S was small, provided
            # that H was much smaller.
            # Rob Hooft stared at these problems and invented the measure
            # we use now, the simpler S-H, scaled into [0., 1.].
            prob = (S-H + 1.0) / 2.0
        else:
            prob = 0.5

        if evidence:
            clues = [(w, p) for p, w, r in clues]
            clues.sort(lambda a, b: cmp(a[1], b[1]))
            clues.insert(0, ('*S*', S))
            clues.insert(0, ('*H*', H))
            return prob, clues
        else:
            return prob

    if options.use_chi_squared_combining:
        spamprob = chi2_spamprob

    def learn(self, wordstream, is_spam):
        """Teach the classifier by example.

        wordstream is a word stream representing a message.  If is_spam is
        True, you're telling the classifier this message is definitely spam,
        else that it's definitely not spam.

        """

        self._add_msg(wordstream, is_spam)

    def unlearn(self, wordstream, is_spam):
        """In case of pilot error, call unlearn ASAP after screwing up.

        Pass the same arguments you passed to learn().
        """
        self._remove_msg(wordstream, is_spam)

    def probability(self, record):
        """Compute, store, and return prob(msg is spam | msg contains word).

        This is the Graham calculation, but stripped of biases, and
        stripped of clamping into 0.01 thru 0.99.  The Bayesian
        adjustment following keeps them in a sane range, and one
        that naturally grows the more evidence there is to back up
        a probability.
        """

        spamcount = record.spamcount
        hamcount = record.hamcount

        # Try the cache first
        try:
            return self.probcache[spamcount][hamcount]
        except KeyError:
            pass

        nham = float(self.nham or 1)
        nspam = float(self.nspam or 1)

        assert hamcount <= nham
        hamratio = hamcount / nham

        assert spamcount <= nspam
        spamratio = spamcount / nspam

        prob = spamratio / (hamratio + spamratio)

        if options.experimental_ham_spam_imbalance_adjustment:
            spam2ham = min(nspam / nham, 1.0)
            ham2spam = min(nham / nspam, 1.0)
        else:
            spam2ham = ham2spam = 1.0

        S = options.unknown_word_strength
        StimesX = S * options.unknown_word_prob

        # Now do Robinson's Bayesian adjustment.
        #
        #         s*x + n*p(w)
        # f(w) = --------------
        #           s + n
        #
        # I find this easier to reason about like so (equivalent when
        # s != 0):
        #
        #        x - p
        #  p +  -------
        #       1 + n/s
        #
        # IOW, it moves p a fraction of the distance from p to x, and
        # less so the larger n is, or the smaller s is.

        # Experimental:
        # Picking a good value for n is interesting:  how much empirical
        # evidence do we really have?  If nham == nspam,
        # hamcount + spamcount makes a lot of sense, and the code here
        # does that by default.
        # But if, e.g., nham is much larger than nspam, p(w) can get a
        # lot closer to 0.0 than it can get to 1.0.  That in turn makes
        # strong ham words (high hamcount) much stronger than strong
        # spam words (high spamcount), and that makes the accidental
        # appearance of a strong ham word in spam much more damaging than
        # the accidental appearance of a strong spam word in ham.
        # So we don't give hamcount full credit when nham > nspam (or
        # spamcount when nspam > nham):  instead we knock hamcount down
        # to what it would have been had nham been equal to nspam.  IOW,
        # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
        # we don't "believe" any count to an extent more than
        # min(nspam, nham) justifies.

        n = hamcount * spam2ham  +  spamcount * ham2spam
        prob = (StimesX + n * prob) / (S + n)

        # Update the cache
        try:
            self.probcache[spamcount][hamcount] = prob
        except KeyError:
            self.probcache[spamcount] = {hamcount: prob}

        return prob

    # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
    # n>1 times in a single message, training added n to the word's hamcount
    # or spamcount, but predicting scored words only once.  Tests showed
    # that adding only 1 in training, or scoring more than once when
    # predicting, hurt under the Graham scheme.
    # This isn't so under Robinson's scheme, though:  results improve
    # if training also counts a word only once.  The mean ham score decreases
    # significantly and consistently, ham score variance decreases likewise,
    # mean spam score decreases (but less than mean ham score, so the spread
    # increases), and spam score variance increases.
    # I (Tim) speculate that adding n times under the Graham scheme helped
    # because it acted against the various ham biases, giving frequently
    # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
    # adding only once in training, a word like that was simply ignored until
    # it appeared in 5 distinct training spams.  Without the ham-favoring
    # biases, though, and never ignoring words, counting n times introduces
    # a subtle and unhelpful bias.
    # There does appear to be some useful info in how many times a word
    # appears in a msg, but distorting spamprob doesn't appear a correct way
    # to exploit it.
    def _add_msg(self, wordstream, is_spam):
        self.probcache = {}    # nuke the prob cache
        if is_spam:
            self.nspam += 1
        else:
            self.nham += 1

        for word in Set(wordstream):
            record = self._wordinfoget(word)
            if record is None:
                record = self.WordInfoClass()

            if is_spam:
                record.spamcount += 1
            else:
                record.hamcount += 1

            self._wordinfoset(word, record)

    def _remove_msg(self, wordstream, is_spam):
        self.probcache = {}    # nuke the prob cache
        if is_spam:
            if self.nspam <= 0:
                raise ValueError("spam count would go negative!")
            self.nspam -= 1
        else:
            if self.nham <= 0:
                raise ValueError("non-spam count would go negative!")
            self.nham -= 1

        for word in Set(wordstream):
            record = self._wordinfoget(word)
            if record is not None:
                if is_spam:
                    if record.spamcount > 0:
                        record.spamcount -= 1
                else:
                    if record.hamcount > 0:
                        record.hamcount -= 1
                if record.hamcount == 0 == record.spamcount:
                    self._wordinfodel(word)
                else:
                    self._wordinfoset(word, record)

    def _getclues(self, wordstream):
        mindist = options.minimum_prob_strength
        unknown = options.unknown_word_prob

        clues = []  # (distance, prob, word, record) tuples
        pushclue = clues.append

        for word in Set(wordstream):
            record = self._wordinfoget(word)
            if record is None:
                prob = unknown
            else:
                prob = self.probability(record)
            distance = abs(prob - 0.5)
            if distance >= mindist:
                pushclue((distance, prob, word, record))

        clues.sort()
        if len(clues) > options.max_discriminators:
            del clues[0 : -options.max_discriminators]
        # Return (prob, word, record).
        return [t[1:] for t in clues]

    def _wordinfoget(self, word):
        return self.wordinfo.get(word)

    def _wordinfoset(self, word, record):
        self.wordinfo[word] = record

    def _wordinfodel(self, word):
        del self.wordinfo[word]

Bayes = Classifier

--- NEW FILE: dbmstorage.py ---
"""Wrapper to open an appropriate dbm storage type."""

from spambayes.Options import options
import sys

class error(Exception):
    pass

def open_db3hash(*args):
    """Open a bsddb3 hash."""
    import bsddb3
    return bsddb3.hashopen(*args)

def open_dbhash(*args):
    """Open a bsddb hash.  Don't use this on Windows."""
    import bsddb
    return bsddb.hashopen(*args)

def open_gdbm(*args):
    """Open a gdbm database."""
    import gdbm
    return gdbm.open(*args)

def open_dumbdbm(*args):
    """Open a dumbdbm database."""
    import dumbdbm
    return dumbdbm.open(*args)

def open_best(*args):
    if sys.platform == "win32":
        funcs = [open_db3hash, open_gdbm, open_dumbdbm]
    else:
        funcs = [open_db3hash, open_dbhash, open_gdbm, open_dumbdbm]
    for f in funcs:
        try:
            return f(*args)
        except ImportError:
            pass
    raise error("No dbm modules available!")

open_funcs = {
    "best": open_best,
    "db3hash": open_db3hash,
    "dbhash": open_dbhash,
    "gdbm": open_gdbm,
    "dumbdbm": open_dumbdbm,
    }

def open(*args):
    dbm_type = options.dbm_type.lower()
    f = open_funcs.get(dbm_type)
    if not f:
        raise error("Unknown dbm type in options file")
    return f(*args)

--- NEW FILE: hammie.py ---
#! /usr/bin/env python

from spambayes import mboxutils
from spambayes import storage
from spambayes.Options import options
from spambayes.tokenizer import tokenize

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

class Hammie:
    """A spambayes mail filter.

    This implements the basic functionality needed to score, filter, or
    train.  

    """

    def __init__(self, bayes):
        self.bayes = bayes

    def _scoremsg(self, msg, evidence=False):
        """Score a Message.

        msg can be a string, a file object, or a Message object.

        Returns the probability the message is spam.  If evidence is
        true, returns a tuple: (probability, clues), where clues is a
        list of the words which contributed to the score.

        """

        return self.bayes.spamprob(tokenize(msg), evidence)

    def formatclues(self, clues, sep="; "):
        """Format the clues into something readable."""

        return sep.join(["%r: %.2f" % (word, prob)
                         for word, prob in clues
                         if (word[0] == '*' or
                             prob <= options.clue_mailheader_cutoff or
                             prob >= 1.0 - options.clue_mailheader_cutoff)])

    def score(self, msg, evidence=False):
        """Score (judge) a message.

        msg can be a string, a file object, or a Message object.

        Returns the probability the message is spam.  If evidence is
        true, returns a tuple: (probability, clues), where clues is a
        list of the words which contributed to the score.

        """

        return self._scoremsg(msg, evidence)

    def filter(self, msg, header=None, spam_cutoff=None,
               ham_cutoff=None, debugheader=None,
               debug=None):
        """Score (judge) a message and add a disposition header.

        msg can be a string, a file object, or a Message object.

        Optionally, set header to the name of the header to add, and/or
        spam_cutoff/ham_cutoff to the probability values which must be met
        or exceeded for a message to get a 'Spam' or 'Ham' classification.

        An extra debugging header can be added if 'debug' is set to True.
        The name of the debugging header is given as 'debugheader'.

        All defaults for optional parameters come from the Options file.

        Returns the same message with a new disposition header.

        """

        if header == None:
            header = options.hammie_header_name
        if spam_cutoff == None:
            spam_cutoff = options.spam_cutoff
        if ham_cutoff == None:
            ham_cutoff = options.ham_cutoff
        if debugheader == None:
            debugheader = options.hammie_debug_header_name
        if debug == None:
            debug = options.hammie_debug_header

        msg = mboxutils.get_message(msg)
        try:
            del msg[header]
        except KeyError:
            pass
        prob, clues = self._scoremsg(msg, True)
        if prob < ham_cutoff:
            disp = options.header_ham_string
        elif prob > spam_cutoff:
            disp = options.header_spam_string
        else:
            disp = options.header_unsure_string
        disp += ("; %."+str(options.header_score_digits)+"f") % prob
        if options.header_score_logarithm:
            if prob<=0.005 and prob>0.0:
                import math
                x=-math.log10(prob)
                disp += " (%d)"%x
            if prob>=0.995 and prob<1.0:
                import math
                x=-math.log10(1.0-prob)
                disp += " (%d)"%x
        msg.add_header(header, disp)
        if debug:
            disp = self.formatclues(clues)
            msg.add_header(debugheader, disp)
        return msg.as_string(unixfrom=(msg.get_unixfrom() is not None))

    def train(self, msg, is_spam):
        """Train bayes with a message.

        msg can be a string, a file object, or a Message object.

        is_spam should be 1 if the message is spam, 0 if not.

        """

        self.bayes.learn(tokenize(msg), is_spam)

    def untrain(self, msg, is_spam):
        """Untrain bayes with a message.

        msg can be a string, a file object, or a Message object.

        is_spam should be 1 if the message is spam, 0 if not.

        """

        self.bayes.unlearn(tokenize(msg), is_spam)

    def train_ham(self, msg):
        """Train bayes with ham.

        msg can be a string, a file object, or a Message object.

        """

        self.train(msg, False)

    def train_spam(self, msg):
        """Train bayes with spam.

        msg can be a string, a file object, or a Message object.

        """

        self.train(msg, True)

    def untrain_ham(self, msg):
        """Untrain bayes with ham.

        msg can be a string, a file object, or a Message object.

        """

        self.untrain(msg, False)

    def train_spam(self, msg):
        """Untrain bayes with spam.

        msg can be a string, a file object, or a Message object.

        """

        self.untrain(msg, True)

    def store(self):
        """Write out the persistent store.

        This makes sure the persistent store reflects what is currently
        in memory.  You would want to do this after a write and before
        exiting.

        """

        self.bayes.store()

def open(filename, usedb=True, mode='r'):
    """Open a file, returning a Hammie instance.

    If usedb is False, open as a pickle instead of a DBDict.  mode is

    used as the flag to open DBDict objects.  'c' for read-write (create
    if needed), 'r' for read-only, 'w' for read-write.

    """

    if usedb:
        b = storage.DBDictClassifier(filename, mode)
    else:
        b = storage.PickledClassifier(filename)
    return Hammie(b)

if __name__ == "__main__":
    # Everybody's used to running hammie.py.  Why mess with success?  ;)
    import hammiebulk

    hammiebulk.main()

--- NEW FILE: hammiebulk.py ---
#! /usr/bin/env python

"""Usage: %(program)s [-D|-d] [options]

Where:
    -h
        show usage and exit
    -d
        use the DBM store.  A DBM file is larger than the pickle and
        creating it is slower, but loading it is much faster,
        especially for large word databases.  Recommended for use with
        hammiefilter or any procmail-based filter.
    -D
        use the pickle store.  A pickle is smaller and faster to create,
        but much slower to load.  Recommended for use with pop3proxy and
        hammiesrv.
    -p FILE
        use file as the persistent store.  loads data from this file if it
        exists, and saves data to this file at the end.
        Default: %(DEFAULTDB)s

    -f
        run as a filter: read a single message from stdin, add a new
        header, and write it to stdout.  If you want to run from
        procmail, this is your option.
    -g PATH
        mbox or directory of known good messages (non-spam) to train on.
        Can be specified more than once, or use - for stdin.
    -s PATH
        mbox or directory of known spam messages to train on.
        Can be specified more than once, or use - for stdin.
    -u PATH
        mbox of unknown messages.  A ham/spam decision is reported for each.
        Can be specified more than once.
    -r
        reverse the meaning of the check (report ham instead of spam).
        Only meaningful with the -u option.
"""

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
    def bool(val):
        return not not val

import sys
import os
import types
import getopt
import mailbox
import glob
import email
import errno
import cPickle as pickle

from spambayes.Options import options
from spambayes import classifier, mboxutils, hammie, Corpus

Corpus.Verbose = True

program = sys.argv[0] # For usage(); referenced by docstring above

# Default database name
DEFAULTDB = os.path.expanduser(options.hammiefilter_persistent_storage_file)

# Probability at which a message is considered spam
SPAM_THRESHOLD = options.spam_cutoff
HAM_THRESHOLD = options.ham_cutoff

def train(h, msgs, is_spam):
    """Train bayes with all messages from a mailbox."""
    mbox = mboxutils.getmbox(msgs)
    i = 0
    for msg in mbox:
        i += 1
        sys.stdout.write("\r%6d" % i)
        sys.stdout.flush()
        h.train(msg, is_spam)
    print

def score(h, msgs, reverse=0):
    """Score (judge) all messages from a mailbox."""
    # XXX The reporting needs work!
    mbox = mboxutils.getmbox(msgs)
    i = 0
    spams = hams = 0
    for msg in mbox:
        i += 1
        prob, clues = h.score(msg, True)
        if hasattr(msg, '_mh_msgno'):
            msgno = msg._mh_msgno
        else:
            msgno = i
        isspam = (prob >= SPAM_THRESHOLD)
        if isspam:
            spams += 1
            if not reverse:
                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
                print h.formatclues(clues)
        else:
            hams += 1
            if reverse:
                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
                print h.formatclues(clues)
    return (spams, hams)

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def main():
    """Main program; parse options and go."""
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
    except getopt.error, msg:
        usage(2, msg)

    if not opts:
        usage(2, "No options given")

    pck = DEFAULTDB
    good = []
    spam = []
    unknown = []
    reverse = 0
    do_filter = False
    usedb = None
    mode = 'r'
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-g':
            good.append(arg)
            mode = 'c'
        elif opt == '-s':
            spam.append(arg)
            mode = 'c'
        elif opt == '-p':
            pck = arg
        elif opt == "-d":
            usedb = True
        elif opt == "-D":
            usedb = False
        elif opt == "-f":
            do_filter = True
        elif opt == '-u':
            unknown.append(arg)
        elif opt == '-r':
            reverse = 1
    if args:
        usage(2, "Positional arguments not allowed")

    if usedb == None:
        usage(2, "Must specify one of -d or -D")

    save = False

    h = hammie.open(pck, usedb, mode)

    for g in good:
        print "Training ham (%s):" % g
        train(h, g, False)
        save = True

    for s in spam:
        print "Training spam (%s):" % s
        train(h, s, True)
        save = True

    if save:
        h.store()

    if do_filter:
        msg = sys.stdin.read()
        filtered = h.filter(msg)
        sys.stdout.write(filtered)

    if unknown:
        (spams, hams) = (0, 0)
        for u in unknown:
            if len(unknown) > 1:
                print "Scoring", u
            s, g = score(h, u, reverse)
            spams += s
            hams += g
        print "Total %d spam, %d ham" % (spams, hams)

if __name__ == "__main__":
    main()

--- NEW FILE: mboxutils.py ---
#! /usr/bin/env python
"""Utilities for dealing with various types of mailboxes.

This is mostly a wrapper around the various useful classes in the
standard mailbox module, to do some intelligent guessing of the
mailbox type given a mailbox argument.

+foo      -- MH mailbox +foo
+foo,bar  -- MH mailboxes +foo and +bar concatenated
+ALL      -- a shortcut for *all* MH mailboxes
/foo/bar  -- (existing file) a Unix-style mailbox
/foo/bar/ -- (existing directory) a directory full of .txt and .lorien
             files
/foo/bar/ -- (existing directory with a cur/ subdirectory)
             Maildir mailbox
/foo/Mail/bar/ -- (existing directory with /Mail/ in its path)
             alternative way of spelling an MH mailbox

"""

from __future__ import generators

import os
import sys
import glob
import email
import mailbox
import email.Message
import re

class DirOfTxtFileMailbox:

    """Mailbox directory consisting of .txt and .lorien files."""

    def __init__(self, dirname, factory):
        self.names = (glob.glob(os.path.join(dirname, "*.txt")) +
                      glob.glob(os.path.join(dirname, "*.lorien")))
        self.names.sort()
        self.factory = factory

    def __iter__(self):
        for name in self.names:
            try:
                f = open(name)
            except IOError:
                continue
            yield self.factory(f)
            f.close()

def _cat(seqs):
    for seq in seqs:
        for item in seq:
            yield item

def getmbox(name):
    """Return an mbox iterator given a file/directory/folder name."""

    if name == "-":
        return [get_message(sys.stdin)]

    if name.startswith("+"):
        # MH folder name: +folder, +f1,f2,f2, or +ALL
        name = name[1:]
        import mhlib
        mh = mhlib.MH()
        if name == "ALL":
            names = mh.listfolders()
        elif ',' in name:
            names = name.split(',')
        else:
            names = [name]
        mboxes = []
        mhpath = mh.getpath()
        for name in names:
            filename = os.path.join(mhpath, name)
            mbox = mailbox.MHMailbox(filename, get_message)
            mboxes.append(mbox)
        if len(mboxes) == 1:
            return iter(mboxes[0])
        else:
            return _cat(mboxes)

    if os.path.isdir(name):
        # XXX Bogus: use a Maildir if /cur is a subdirectory, else a MHMailbox
        # if the pathname contains /Mail/, else a DirOfTxtFileMailbox.
        if os.path.exists(os.path.join(name, 'cur')):
            mbox = mailbox.Maildir(name, get_message)
        elif name.find("/Mail/") >= 0:
            mbox = mailbox.MHMailbox(name, get_message)
        else:
            mbox = DirOfTxtFileMailbox(name, get_message)
    else:
        fp = open(name, "rb")
        mbox = mailbox.PortableUnixMailbox(fp, get_message)
    return iter(mbox)

def get_message(obj):
    """Return an email Message object.

    The argument may be a Message object already, in which case it's
    returned as-is.

    If the argument is a string or file-like object (supports read()),
    the email package is used to create a Message object from it.  This
    can fail if the message is malformed.  In that case, the headers
    (everything through the first blank line) are thrown out, and the
    rest of the text is wrapped in a bare email.Message.Message.
    """

    if isinstance(obj, email.Message.Message):
        return obj
    # Create an email Message object.
    if hasattr(obj, "read"):
        obj = obj.read()
    try:
        msg = email.message_from_string(obj)
    except email.Errors.MessageParseError:
        # Wrap the raw text in a bare Message object.  Since the
        # headers are most likely damaged, we can't use the email
        # package to parse them, so just get rid of them first.
        headers = extract_headers(obj)
        obj = obj[len(headers):]
        msg = email.Message.Message()
        msg.set_payload(obj)
    return msg

header_break_re = re.compile(r"\r?\n(\r?\n)")

def extract_headers(text):
    """Very simple-minded header extraction:  prefix of text up to blank line.

    A blank line is recognized via two adjacent line-ending sequences, where
    a line-ending sequence is a newline optionally preceded by a carriage
    return.

    If no blank line is found, all of text is considered to be a potential
    header section.  If a blank line is found, the text up to (but not
    including) the blank line is considered to be a potential header section.

    The potential header section is returned, unless it doesn't contain a
    colon, in which case an empty string is returned.

    >>> extract_headers("abc")
    ''
    >>> extract_headers("abc\\n\\n\\n")  # no colon
    ''
    >>> extract_headers("abc: xyz\\n\\n\\n")
    'abc: xyz\\n'
    >>> extract_headers("abc: xyz\\r\\n\\r\\n\\r\\n")
    'abc: xyz\\r\\n'
    >>> extract_headers("a: b\\ngibberish\\n\\nmore gibberish")
    'a: b\\ngibberish\\n'
    """

    m = header_break_re.search(text)
    if m:
        eol = m.start(1)
        text = text[:eol]
    if ':' not in text:
        text = ""
    return text

def _test():
    import doctest, mboxutils
    return doctest.testmod(mboxutils)

if __name__ == "__main__":
    _test()

--- NEW FILE: msgs.py ---
from __future__ import generators

import os
import random

from spambayes.tokenizer import tokenize

HAMTEST  = None
SPAMTEST = None
HAMTRAIN  = None
SPAMTRAIN = None
SEED = random.randrange(2000000000)

class Msg(object):
    __slots__ = 'tag', 'guts'

    def __init__(self, dir, name):
        path = dir + "/" + name
        self.tag = path
        f = open(path, 'rb')
        self.guts = f.read()
        f.close()

    def __iter__(self):
        return tokenize(self.guts)

    # Compare msgs by their paths; this is appropriate for sets of msgs.
    def __hash__(self):
        return hash(self.tag)

    def __eq__(self, other):
        return self.tag == other.tag

    def __str__(self):
        return self.guts

# The iterator yields a stream of Msg objects, taken from a list of
# directories.
class MsgStream(object):
    __slots__ = 'tag', 'directories', 'keep'

    def __init__(self, tag, directories, keep=None):
        self.tag = tag
        self.directories = directories
        self.keep = keep

    def __str__(self):
        return self.tag

    def produce(self):
        if self.keep is None:
            for directory in self.directories:
                for fname in os.listdir(directory):
                    yield Msg(directory, fname)
            return
        # We only want part of the msgs.  Shuffle each directory list, but
        # in such a way that we'll get the same result each time this is
        # called on the same directory list.
        for directory in self.directories:
            all = os.listdir(directory)
            random.seed(hash(max(all)) ^ SEED) # reproducible across calls
            random.shuffle(all)
            del all[self.keep:]
            all.sort()  # seems to speed access on Win98!
            for fname in all:
                yield Msg(directory, fname)

    def __iter__(self):
        return self.produce()

class HamStream(MsgStream):
    def __init__(self, tag, directories, train=0):
        if train:
            MsgStream.__init__(self, tag, directories, HAMTRAIN)
        else:
            MsgStream.__init__(self, tag, directories, HAMTEST)

class SpamStream(MsgStream):
    def __init__(self, tag, directories, train=0):
        if train:
            MsgStream.__init__(self, tag, directories, SPAMTRAIN)
        else:
            MsgStream.__init__(self, tag, directories, SPAMTEST)

def setparms(hamtrain, spamtrain, hamtest=None, spamtest=None, seed=None):
    """Set HAMTEST/TRAIN and SPAMTEST/TRAIN.
       If seed is not None, also set SEED.
       If (ham|spam)test are not set, set to the same as the (ham|spam)train
       numbers (backwards compat option).
    """

    global HAMTEST, SPAMTEST, HAMTRAIN, SPAMTRAIN, SEED
    HAMTRAIN, SPAMTRAIN = hamtrain, spamtrain
    if hamtest is None:
        HAMTEST = HAMTRAIN
    else:
        HAMTEST = hamtest
    if spamtest is None:
        SPAMTEST = SPAMTRAIN
    else:
        SPAMTEST = spamtest
    if seed is not None:
        SEED = seed

--- NEW FILE: optimize.py ---
#
__version__ = '$Id: optimize.py,v 1.1.2.1 2003/01/10 10:41:08 anthonybaxter Exp $'
#
# Optimize any parametric function.
#
import copy
import Numeric

def SimplexMaximize(var, err, func, convcrit = 0.001, minerr = 0.001):
    var = Numeric.array(var)
    simplex = [var]
    for i in range(len(var)):
        var2 = copy.copy(var)
        var2[i] = var[i] + err[i]
        simplex.append(var2)
    value = []
    for i in range(len(simplex)):
        value.append(func(simplex[i]))
    while 1:
        # Determine worst and best
        wi = 0
        bi = 0
        for i in range(len(simplex)):
            if value[wi] > value[i]:
                wi = i
            if value[bi] < value[i]:
                bi = i
        # Test for convergence
        #print "worst, best are",wi,bi,"with",value[wi],value[bi]
        if abs(value[bi] - value[wi]) <= convcrit:
            return simplex[bi]
        # Calculate average of non-worst
        ave=Numeric.zeros(len(var), 'd')
        for i in range(len(simplex)):
            if i != wi:
                ave = ave + simplex[i]
        ave = ave / (len(simplex) - 1)
        worst = Numeric.array(simplex[wi])
        # Check for too-small simplex
        simsize = Numeric.add.reduce(Numeric.absolute(ave - worst))
        if simsize <= minerr:
            #print "Size of simplex too small:",simsize
            return simplex[bi]
        # Invert worst
        new = 2 * ave - simplex[wi]
        newv = func(new)
        if newv <= value[wi]:
            # Even worse. Shrink instead
            #print "Shrunk simplex"
            #print "ave=",repr(ave)
            #print "wi=",repr(worst)
            new = 0.5 * ave + 0.5 * worst
            newv = func(new)
        elif newv > value[bi]:
            # Better than the best. Expand
            new2 = 3 * ave - 2 * worst
            newv2 = func(new2)
            if newv2 > newv:
                # Accept
                #print "Expanded simplex"
                new = new2
                newv = newv2
        simplex[wi] = new
        value[wi] = newv

def DoubleSimplexMaximize(var, err, func, convcrit=0.001, minerr=0.001):
    err = Numeric.array(err)
    var = SimplexMaximize(var, err, func, convcrit*5, minerr*5)
    return SimplexMaximize(var, 0.4 * err, func, convcrit, minerr)

--- NEW FILE: storage.py ---
#! /usr/bin/env python

'''storage.py - Spambayes database management framework.

Classes:
    PickledClassifier - Classifier that uses a pickle db
    DBDictClassifier - Classifier that uses a shelve db
    Trainer - Classifier training observer
    SpamTrainer - Trainer for spam
    HamTrainer - Trainer for ham

Abstract:
    *Classifier are subclasses of Classifier (classifier.Classifier)
    that add automatic state store/restore function to the Classifier class.

    PickledClassifier is a Classifier class that uses a cPickle
    datastore.  This database is relatively small, but slower than other
    databases.

    DBDictClassifier is a Classifier class that uses a database
    store.

    Trainer is concrete class that observes a Corpus and trains a
    Classifier object based upon movement of messages between corpora  When
    an add message notification is received, the trainer trains the
    database with the message, as spam or ham as appropriate given the
    type of trainer (spam or ham).  When a remove message notification
    is received, the trainer untrains the database as appropriate.

    SpamTrainer and HamTrainer are convenience subclasses of Trainer, that
    initialize as the appropriate type of Trainer

To Do:
    o ZODBClassifier
    o Would Trainer.trainall really want to train with the whole corpus,
        or just a random subset?
    o Suggestions?

    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Neale Pickett <neale at woozle.org>, \
Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "All the spambayes contributors."

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
    def bool(val):
        return not not val

from spambayes import classifier
from spambayes.Options import options
import cPickle as pickle
import errno
import shelve
from spambayes import dbmstorage

# Make shelve use binary pickles by default.
oldShelvePickler = shelve.Pickler
def binaryDefaultPickler(f, binary=1):
    return oldShelvePickler(f, binary)
shelve.Pickler = binaryDefaultPickler

PICKLE_TYPE = 1
NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
UPDATEPROBS = True       # Probabilities will be autoupdated with training

class PickledClassifier(classifier.Classifier):
    '''Classifier object persisted in a pickle'''

    def __init__(self, db_name):
        classifier.Classifier.__init__(self)
        self.db_name = db_name
        self.load()

    def load(self):
        '''Load this instance from the pickle.'''
        # This is a bit strange, because the loading process
        # creates a temporary instance of PickledClassifier, from which
        # this object's state is copied.  This is a nuance of the way
        # that pickle does its job

        if options.verbose:
            print 'Loading state from',self.db_name,'pickle'

        tempbayes = None
        try:
            fp = open(self.db_name, 'rb')
        except IOError, e:
            if e.errno != errno.ENOENT: raise
        else:
            tempbayes = pickle.load(fp)
            fp.close()

        # XXX: why not self.__setstate__(tempbayes.__getstate__())?
        if tempbayes:
            self.wordinfo = tempbayes.wordinfo
            self.nham = tempbayes.nham
            self.nspam = tempbayes.nspam

            if options.verbose:
                print '%s is an existing pickle, with %d ham and %d spam' \
                      % (self.db_name, self.nham, self.nspam)
        else:
            # new pickle
            if options.verbose:
                print self.db_name,'is a new pickle'
            self.wordinfo = {}
            self.nham = 0
            self.nspam = 0

    def store(self):
        '''Store self as a pickle'''

        if options.verbose:
            print 'Persisting',self.db_name,'as a pickle'

        fp = open(self.db_name, 'wb')
        pickle.dump(self, fp, PICKLE_TYPE)
        fp.close()

class DBDictClassifier(classifier.Classifier):
    '''Classifier object persisted in a caching database'''

    def __init__(self, db_name, mode='c'):
        '''Constructor(database name)'''

        classifier.Classifier.__init__(self)
        self.wordcache = {}
        self.statekey = "saved state"
        self.mode = mode
        self.db_name = db_name
        self.load()

    def load(self):
        '''Load state from database'''

        if options.verbose:
            print 'Loading state from',self.db_name,'database'

        self.dbm = dbmstorage.open(self.db_name, self.mode)
        self.db = shelve.Shelf(self.dbm)

        if self.db.has_key(self.statekey):
            t = self.db[self.statekey]
            if t[0] != classifier.PICKLE_VERSION:
                raise ValueError("Can't unpickle -- version %s unknown" % t[0])
            (self.nspam, self.nham) = t[1:]

            if options.verbose:
                print '%s is an existing database, with %d spam and %d ham' \
                      % (self.db_name, self.nspam, self.nham)
        else:
            # new database
            if options.verbose:
                print self.db_name,'is a new database'
            self.nspam = 0
            self.nham = 0
        self.wordinfo = {}

    def store(self):
        '''Place state into persistent store'''

        if options.verbose:
            print 'Persisting',self.db_name,'state in database'

        # Must use .keys() since we modify the dict in the loop
        for key in self.wordinfo.keys():
            val = self.wordinfo[key]
            if val == None:
                del self.wordinfo[key]
                try:
                    del self.db[key]
                except KeyError:
                    pass
            else:
                self.db[key] = val.__getstate__()
        self.db[self.statekey] = (classifier.PICKLE_VERSION,
                                  self.nspam, self.nham)
        self.db.sync()

    def _wordinfoget(self, word):
        ret = self.wordinfo.get(word)
        if not ret:
            r = self.db.get(word)
            if r:
                ret = self.WordInfoClass()
                ret.__setstate__(r)
                self.wordinfo[word] = ret
        return ret

    # _wordinfoset is the same

    def _wordinfodel(self, word):
        self.wordinfo[word] = None

class Trainer:
    '''Associates a Classifier object and one or more Corpora, \
    is an observer of the corpora'''

    def __init__(self, bayes, is_spam, updateprobs=NO_UPDATEPROBS):
        '''Constructor(Classifier, is_spam(True|False), updprobs(True|False)'''

        self.bayes = bayes
        self.is_spam = is_spam
        self.updateprobs = updateprobs

    def onAddMessage(self, message):
        '''A message is being added to an observed corpus.'''

        self.train(message)

    def train(self, message):
        '''Train the database with the message'''

        if options.verbose:
            print 'training with',message.key()

        self.bayes.learn(message.tokenize(), self.is_spam)
#                         self.updateprobs)

    def onRemoveMessage(self, message):
        '''A message is being removed from an observed corpus.'''

        self.untrain(message)

    def untrain(self, message):
        '''Untrain the database with the message'''

        if options.verbose:
            print 'untraining with',message.key()

        self.bayes.unlearn(message.tokenize(), self.is_spam)
#                           self.updateprobs)
        # can raise ValueError if database is fouled.  If this is the case,
        # then retraining is the only recovery option.

    def trainAll(self, corpus):
        '''Train all the messages in the corpus'''

        for msg in corpus:
            self.train(msg)

    def untrainAll(self, corpus):
        '''Untrain all the messages in the corpus'''

        for msg in corpus:
            self.untrain(msg)

class SpamTrainer(Trainer):
    '''Trainer for spam'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, True, updateprobs)

class HamTrainer(Trainer):
    '''Trainer for ham'''

    def __init__(self, bayes, updateprobs=NO_UPDATEPROBS):
        '''Constructor'''

        Trainer.__init__(self, bayes, False, updateprobs)

if __name__ == '__main__':
    print >>sys.stderr, __doc__

--- NEW FILE: tokenizer.py ---
#! /usr/bin/env python
"""Module to tokenize email messages for spam filtering."""

from __future__ import generators

import email
import email.Message
import email.Header
import email.Utils
import email.Errors
import re
import math
import time
import os
from sets import Set

from spambayes.Options import options

from spambayes.mboxutils import get_message
[...1305 lines suppressed...]
                text, tokens = cracker(text)
                for t in tokens:
                    yield t

            # Remove HTML/XML tags.  Also &nbsp;.
            text = text.replace('&nbsp;', ' ')
            text = html_re.sub(' ', text)

            # Tokenize everything in the body.
            for w in text.split():
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= maxword:
                    yield w

                elif n >= 3:
                    for t in tokenize_word(w):
                        yield t

tokenize = Tokenizer().tokenize