[Spambayes-checkins] spambayes/scripts sb_client.py, NONE, 1.1 sb_dbexpimp.py, NONE, 1.1 sb_filter.py, NONE, 1.1 sb_imapfilter.py, NONE, 1.1 sb_mailsort.py, NONE, 1.1 sb_mboxtrain.py, NONE, 1.1 sb_notesfilter.py, NONE, 1.1 sb_pop3dnd.py, NONE, 1.1 sb_server.py, NONE, 1.1 sb_smtpproxy.py, NONE, 1.1 sb_unheader.py, NONE, 1.1 sb_upload.py, NONE, 1.1 sb_xmlrpcserver.py, NONE, 1.1 sb-client.py, 1.1, NONE sb-dbexpimp.py, 1.1, NONE sb-filter.py, 1.1, NONE sb-imapfilter.py, 1.1, NONE sb-mailsort.py, 1.1, NONE sb-mboxtrain.py, 1.1, NONE sb-notesfilter.py, 1.1, NONE sb-pop3dnd.py, 1.1, NONE sb-server.py, 1.1, NONE sb-smtpproxy.py, 1.1, NONE sb-unheader.py, 1.1, NONE sb-upload.py, 1.1, NONE sb-xmlrpcserver.py, 1.1, NONE

Tony Meyer anadelonbrin at users.sourceforge.net
Thu Sep 4 19:16:48 EDT 2003


Update of /cvsroot/spambayes/spambayes/scripts
In directory sc8-pr-cvs1:/tmp/cvs-serv14316/scripts

Added Files:
	sb_client.py sb_dbexpimp.py sb_filter.py sb_imapfilter.py 
	sb_mailsort.py sb_mboxtrain.py sb_notesfilter.py sb_pop3dnd.py 
	sb_server.py sb_smtpproxy.py sb_unheader.py sb_upload.py 
	sb_xmlrpcserver.py 
Removed Files:
	sb-client.py sb-dbexpimp.py sb-filter.py sb-imapfilter.py 
	sb-mailsort.py sb-mboxtrain.py sb-notesfilter.py sb-pop3dnd.py 
	sb-server.py sb-smtpproxy.py sb-unheader.py sb-upload.py 
	sb-xmlrpcserver.py 
Log Message:
Crap!  We can't use "sb-" as a prefix, because then we can't import the scripts.
I guess that all the importable code could be moved into modules, but that seems
like a huge hassle.  Let's use "sb_" as a prefix instead.

Apologies for cluttering the attic...sigh.

--- NEW FILE: sb_client.py ---
#! /usr/bin/env python

"""A client for hammiesrv.

Just feed it your mail on stdin, and it spits out the same message
with the spambayes score in a new X-Spambayes-Disposition header.

"""

import xmlrpclib
import sys

RPCBASE="http://localhost:65000"

def main():
    msg = sys.stdin.read()
    try:
        x = xmlrpclib.ServerProxy(RPCBASE)
        m = xmlrpclib.Binary(msg)
        out = x.filter(m)
        print out.data
    except:
        if __debug__:
            import traceback
            traceback.print_exc()
        print msg

if __name__ == "__main__":
    main()

--- NEW FILE: sb_dbexpimp.py ---
#! /usr/bin/env python

"""dbExpImp.py - Bayes database export/import

Classes:


Abstract:

    This utility has the primary function of exporting and importing
    a spambayes database into/from a flat file.  This is useful in a number
    of scenarios.
    
    Platform portability of database - flat files can be exported and
    imported across platforms (winduhs and linux, for example)
    
    Database implementation changes - databases can survive database
    implementation upgrades or new database implementations.  For example,
    if a dbm implementation changes between python x.y and python x.y+1...
    
    Database reorganization - an export followed by an import reorgs an
    existing database, <theoretically> improving performance, at least in 
    some database implementations
    
    Database sharing - it is possible to distribute particular databases
    for research purposes, database sharing purposes, or for new users to
    have a 'seed' database to start with.
    
    Database merging - multiple databases can be merged into one quite easily
    by simply not specifying -n on an import.  This will add the two database
    nham and nspams together (assuming the two databases do not share corpora)
    and for wordinfo conflicts, will add spamcount and hamcount together.
    
    Spambayes software release migration - an export can be executed before
    a release upgrade, as part of the installation script.  Then, after the
    new software is installed, an import can be executed, which will
    effectively preserve existing training.  This eliminates the need for
    retraining every time a release is installed.
    
    Others?  I'm sure I haven't thought of everything...
    
Usage:
    dbExpImp [options]

        options:
            -e     : export
            -i     : import
            -v     : verbose mode (some additional diagnostic messages)
            -f: FN : flat file to export to or import from
            -d: FN : name of pickled database file to use
            -D: FN : name of dbm database file to use
            -m     : merge import into an existing database file.  This is
                     meaningful only for import. If omitted, a new database
                     file will be created.  If specified, the imported
                     wordinfo will be merged into an existing database.
                     Run dbExpImp -h for more information.
            -h     : help

Examples:

    Export pickled mybayes.db into mybayes.db.export as a csv flat file
        dbExpImp -e -d mybayes.db -f mybayes.db.export
        
    Import mybayes.eb.export into a new DBM mybayes.db
        dbExpImp -i -D mybayes.db -f mybayes.db.export
       
    Export, then import (reorganize) new pickled mybayes.db
        dbExpImp -e -i -n -d mybayes.db -f mybayes.db.export
        
    Convert a bayes database from pickle to DBM
        dbExpImp -e -d abayes.db -f abayes.export
        dbExpImp -i -D abayes.db -f abayes.export
        
    Create a new database (newbayes.db) from two
        databases (abayes.db, bbayes.db)
        dbExpImp -e -d abayes.db -f abayes.export
        dbExpImp -e -d bbayes.db -f bbayes.export
        dbExpImp -i -d newbayes.db -f abayes.export
        dbExpImp -i -m -d newbayes.db -f bbayes.export

To Do:
    o Suggestions?

"""

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim at fourstonesExpressions.com>"

from __future__ import generators

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
    
import spambayes.storage
from spambayes.Options import options
import sys, os, getopt, errno, re
import urllib

def runExport(dbFN, useDBM, outFN):

    if useDBM:
        bayes = spambayes.storage.DBDictClassifier(dbFN)
        words = bayes.db.keys()
        words.remove(bayes.statekey)
    else:
        bayes = spambayes.storage.PickledClassifier(dbFN)
        words = bayes.wordinfo.keys()

    try:
        fp = open(outFN, 'w')
    except IOError, e:
        if e.errno != errno.ENOENT:
           raise
       
    nham = bayes.nham;
    nspam = bayes.nspam;
    
    print "Exporting database %s to file %s" % (dbFN, outFN)
    print "Database has %s ham, %s spam, and %s words" \
            % (nham, nspam, len(words))
    
    fp.write("%s,%s,\n" % (nham, nspam))
    
    for word in words:
        wi = bayes._wordinfoget(word)
        hamcount = wi.hamcount
        spamcount = wi.spamcount
        word = urllib.quote(word)
        fp.write("%s`%s`%s`\n" % (word, hamcount, spamcount))
        
    fp.close()

def runImport(dbFN, useDBM, newDBM, inFN):

    if newDBM:
        try:
            os.unlink(dbFN)
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise
                
        try:
            os.unlink(dbFN+".dat")
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise
                
        try:
            os.unlink(dbFN+".dir")
        except OSError, e:
            if e.errno != 2:     # errno.<WHAT>
                raise
                
    if useDBM:
        bayes = spambayes.storage.DBDictClassifier(dbFN)
    else:
        bayes = spambayes.storage.PickledClassifier(dbFN)

    try:
        fp = open(inFN, 'r')
    except IOError, e:
        if e.errno != errno.ENOENT:
           raise
    
    nline = fp.readline()
    (nham, nspam, junk) = re.split(',', nline)
 
    if newDBM:
        bayes.nham = int(nham)
        bayes.nspam = int(nspam)
    else:
        bayes.nham += int(nham)
        bayes.nspam += int(nspam)
    
    if newDBM:
        impType = "Importing"
    else:
        impType = "Merging"
  
    print "%s database %s using file %s" % (impType, dbFN, inFN)

    lines = fp.readlines()
    
    for line in lines:
        (word, hamcount, spamcount, junk) = re.split('`', line)
        word = urllib.unquote(word)
       
        try:
            wi = bayes.wordinfo[word]
        except KeyError:
            wi = bayes.WordInfoClass()

        wi.hamcount += int(hamcount)
        wi.spamcount += int(spamcount)
               
        bayes._wordinfoset(word, wi)

    fp.close()

    print "Storing database, please be patient.  Even moderately large"
    print "databases may take a very long time to store."
    bayes.store()
    print "Finished storing database"
    
    if useDBM:
        words = bayes.db.keys()
        words.remove(bayes.statekey)
    else:
        words = bayes.wordinfo.keys()
        
    print "Database has %s ham, %s spam, and %s words" \
           % (bayes.nham, bayes.nspam, len(words))




if __name__ == '__main__':

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'iehmvd:D:f:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    usePickle = False
    useDBM = False
    newDBM = True
    dbFN = None
    flatFN = None
    exp = False
    imp = False

    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-d':
            useDBM = False
            dbFN = arg
        elif opt == '-D':
            useDBM = True
            dbFN = arg
        elif opt == '-f':
            flatFN = arg
        elif opt == '-e':
            exp = True
        elif opt == '-i':
            imp = True
        elif opt == '-m':
            newDBM = False
        elif opt == '-v':
            options["globals", "verbose"] = True

    if (dbFN and flatFN):
        if exp:
            runExport(dbFN, useDBM, flatFN)
        if imp:
            runImport(dbFN, useDBM, newDBM, flatFN)
    else:
        print >>sys.stderr, __doc__
--- NEW FILE: sb_filter.py ---
#!/usr/bin/env python

## A hammie front-end to make the simple stuff simple.
##
##
## The intent is to call this from procmail and its ilk like so:
##
##   :0 fw
##   | hammiefilter.py
##
## Then, you can set up your MUA to pipe ham and spam to it, one at a
## time, by calling it with either the -g or -s options, respectively.
##
## Author: Neale Pickett <neale at woozle.org>
##

"""Usage: %(program)s [OPTION]...

[OPTION] is one of:
    -h
        show usage and exit
    -x
        show some usage examples and exit
    -d DBFILE
        use database in DBFILE
    -D PICKLEFILE
        use pickle (instead of database) in PICKLEFILE
    -n
        create a new database
*+  -f
        filter (default if no processing options are given)
*+  -t
        [EXPERIMENTAL] filter and train based on the result (you must
        make sure to untrain all mistakes later)
*   -g
        [EXPERIMENTAL] (re)train as a good (ham) message
*   -s
        [EXPERIMENTAL] (re)train as a bad (spam) message
*   -G
        [EXPERIMENTAL] untrain ham (only use if you've already trained
        this message)
*   -S
        [EXPERIMENTAL] untrain spam (only use if you've already trained
        this message)

All options marked with '*' operate on stdin.  Only those processing options
marked with '+' send a modified message to stdout.
"""

import os
import sys
import getopt
from spambayes import hammie, Options, mboxutils

# See Options.py for explanations of these properties
program = sys.argv[0]

example_doc = """_Examples_

filter a message on disk:
    %(program)s < message

(re)train a message as ham:
    %(program)s -g < message

(re)train a message as spam:
    %(program)s -s < message


procmail recipe to filter and train in one step:
    :0 fw
    | %(program)s -t


mutt configuration.  This binds the 'H' key to retrain the message as
ham, and prompt for a folder to move it to.  The 'S' key retrains as
spam, and moves to a 'spam' folder.
    XXX: add this

"""

def examples():
    print example_doc % globals()
    sys.exit(0)

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

class HammieFilter(object):
    def __init__(self):
        options = Options.options
        # This is a bit of a hack to counter the default for
        # persistent_storage_file changing from ~/.hammiedb to hammie.db
        # This will work unless a user:
        #   * had hammie.db as their value for persistent_storage_file, and
        #   * their config file was loaded by Options.py.
        if options["Storage", "persistent_storage_file"] == \
           options.default("Storage", "persistent_storage_file"):
            options["Storage", "persistent_storage_file"] = \
                                    "~/.hammiedb"
        options.merge_files(['/etc/hammierc',
                            os.path.expanduser('~/.hammierc')])
        self.dbname = options["Storage", "persistent_storage_file"]
        self.dbname = os.path.expanduser(self.dbname)
        self.usedb = options["Storage", "persistent_use_database"]

    def newdb(self):
        h = hammie.open(self.dbname, self.usedb, 'n')
        h.store()
        print >> sys.stderr, "Created new database in", self.dbname

    def filter(self, msg):
        h = hammie.open(self.dbname, self.usedb, 'r')
        return h.filter(msg)

    def filter_train(self, msg):
        h = hammie.open(self.dbname, self.usedb, 'c')
        return h.filter(msg, train=True)

    def train_ham(self, msg):
        h = hammie.open(self.dbname, self.usedb, 'c')
        h.train_ham(msg, True)
        h.store()

    def train_spam(self, msg):
        h = hammie.open(self.dbname, self.usedb, 'c')
        h.train_spam(msg, True)
        h.store()

    def untrain_ham(self, msg):
        h = hammie.open(self.dbname, self.usedb, 'c')
        h.untrain_ham(msg)
        h.store()

    def untrain_spam(self, msg):
        h = hammie.open(self.dbname, self.usedb, 'c')
        h.untrain_spam(msg)
        h.store()

def main():
    h = HammieFilter()
    actions = []
    opts, args = getopt.getopt(sys.argv[1:], 'hxd:D:nfgstGS',
                               ['help', 'examples'])
    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage(0)
        elif opt in ('-x', '--examples'):
            examples()
        elif opt == '-d':
            h.usedb = True
            h.dbname = arg
        elif opt == '-D':
            h.usedb = False
            h.dbname = arg
        elif opt == '-f':
            actions.append(h.filter)
        elif opt == '-g':
            actions.append(h.train_ham)
        elif opt == '-s':
            actions.append(h.train_spam)
        elif opt == '-t':
            actions.append(h.filter_train)
        elif opt == '-G':
            actions.append(h.untrain_ham)
        elif opt == '-S':
            actions.append(h.untrain_spam)
        elif opt == "-n":
            h.newdb()
            sys.exit(0)

    if actions == []:
        actions = [h.filter]

    msg = mboxutils.get_message(sys.stdin)
    for action in actions:
        action(msg)
    sys.stdout.write(msg.as_string(unixfrom=(msg.get_unixfrom() is not None)))

if __name__ == "__main__":
    main()

--- NEW FILE: sb_imapfilter.py ---
#!/usr/bin/env python

"""An IMAP filter.  An IMAP message box is scanned and all non-scored
messages are scored and (where necessary) filtered.

The original filter design owed much to isbg by Roger Binns
(http://www.rogerbinns.com/isbg).

Usage:
    imapfilter [options]

	note: option values with spaces in them must be enclosed
	      in double quotes

        options:
            -d  dbname  : pickled training database filename
            -D  dbname  : dbm training database filename
            -t          : train contents of spam folder and ham folder
            -c          : classify inbox
            -h          : help
            -v          : verbose mode
            -p          : security option to prompt for imap password,
                          rather than look in options["imap", "password"]
            -e y/n      : expunge/purge messages on exit (y) or not (n)
            -i debuglvl : a somewhat mysterious imaplib debugging level
            -l minutes  : period of time between filtering operations
            -b          : Launch a web browser showing the user interface.

Examples:

    Classify inbox, with dbm database
        imapfilter -c -D bayes.db
        
    Train Spam and Ham, then classify inbox, with dbm database
        imapfilter -t -c -D bayes.db

    Train Spam and Ham only, with pickled database
        imapfilter -t -d bayes.db

Warnings:
    o This is alpha software!  The filter is currently being developed and
      tested.  We do *not* recommend using it on a production system unless
      you are confident that you can get your mail back if you lose it.  On
      the other hand, we do recommend that you test it for us and let us
      know if anything does go wrong.
    o By default, the filter does *not* delete, modify or move any of your
      mail.  Due to quirks in how imap works, new versions of your mail are
      modified and placed in new folders, but the originals are still
      available.  These are flagged with the /Deleted flag so that you know
      that they can be removed.  Your mailer may not show these messages
      by default, but there should be an option to do so.  *However*, if
      your mailer automatically purges/expunges (i.e. permanently deletes)
      mail flagged as such, *or* if you set the imap_expunge option to
      True, then this mail will be irretrievably lost.
    
To Do:
    o IMAPMessage and IMAPFolder currently carry out very simple checks
      of responses received from IMAP commands, but if the response is not
      "OK", then the filter terminates.  Handling of these errors could be
      much nicer.
    o IMAP over SSL is untested.
    o Develop a test script, like spambayes/test/test_pop3proxy.py that
      runs through some tests (perhaps with a *real* imap server, rather
      than a dummy one).  This would make it easier to carry out the tests
      against each server whenever a change is made.
    o IMAP supports authentication via other methods than the plain-text
      password method that we are using at the moment.  Neither of the
      servers I have access to offer any alternative method, however.  If
      someone's does, then it would be nice to offer this.
    o Usernames should be able to be literals as well as quoted strings.
      This might help if the username/password has special characters like
      accented characters.
    o Suggestions?
"""

# This module is part of the spambayes project, which is Copyright 2002-3
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tony Meyer <ta-meyer at ihug.co.nz>, Tim Stone"
__credits__ = "All the Spambayes folk."

from __future__ import generators

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

import socket
import os
import re
import time
import sys
import getopt
import types
import email
import email.Parser
from getpass import getpass
from email.Utils import parsedate

from spambayes.Options import options
from spambayes import tokenizer, storage, message, Dibbler
from spambayes.UserInterface import UserInterfaceServer
from spambayes.ImapUI import IMAPUserInterface
from spambayes.Version import get_version_string

from imaplib import IMAP4
from imaplib import Time2Internaldate
try:
    if options["imap", "use_ssl"]:
        from imaplib import IMAP_SSL as BaseIMAP
    else:
        from imaplib import IMAP4 as BaseIMAP
except ImportError:
    from imaplib import IMAP4 as BaseIMAP

# global IMAPlib object
global imap
imap = None

# A flag can have any character in the ascii range 32-126
# except for (){ %*"\
FLAG_CHARS = ""
for i in range(32, 127):
    if not chr(i) in ['(', ')', '{', ' ', '%', '*', '"', '\\']:
        FLAG_CHARS += chr(i)
FLAG = r"\\?[" + re.escape(FLAG_CHARS) + r"]+"
# The empty flag set "()" doesn't match, so that extract returns
# data["FLAGS"] == None
FLAGS_RE = re.compile(r"(FLAGS) (\((" + FLAG + r" )*(" + FLAG + r")\))")
INTERNALDATE_RE = re.compile(r"(INTERNALDATE) (\"\d{1,2}\-[A-Za-z]{3,3}\-" +
                             r"\d{2,4} \d{2,2}\:\d{2,2}\:\d{2,2} " +
                             r"[\+\-]\d{4,4}\")")
RFC822_RE = re.compile(r"(RFC822) (\{[\d]+\})")
RFC822_HEADER_RE = re.compile(r"(RFC822.HEADER) (\{[\d]+\})")
UID_RE = re.compile(r"(UID) ([\d]+)")
FETCH_RESPONSE_RE = re.compile(r"([0-9]+) \(([" + \
                               re.escape(FLAG_CHARS) + r"\"\{\}\(\)\\ ]*)\)?")
LITERAL_RE = re.compile(r"^\{[\d]+\}$")

def _extract_fetch_data(response):
    '''Extract data from the response given to an IMAP FETCH command.'''
    # Response might be a tuple containing literal data
    # At the moment, we only handle one literal per response.  This
    # may need to be improved if our code ever asks for something
    # more complicated (like RFC822.Header and RFC822.Body)
    if type(response) == types.TupleType:
        literal = response[1]
        response = response[0]
    else:
        literal = None
    # the first item will always be the message number
    mo = FETCH_RESPONSE_RE.match(response)
    data = {}
    if mo is None:
        print """IMAP server gave strange fetch response.  Please
        report this as a bug."""
        print response
    else:
        data["message_number"] = mo.group(1)
        response = mo.group(2)
    # We support the following FETCH items:
    #  FLAGS
    #  INTERNALDATE
    #  RFC822
    #  UID
    #  RFC822.HEADER
    # All others are ignored.
    for r in [FLAGS_RE, INTERNALDATE_RE, RFC822_RE, UID_RE,
              RFC822_HEADER_RE]:
        mo = r.search(response)
        if mo is not None:
            if LITERAL_RE.match(mo.group(2)):
                data[mo.group(1)] = literal
            else:
                data[mo.group(1)] = mo.group(2)
    return data

class IMAPSession(BaseIMAP):
    '''A class extending the IMAP4 class, with a few optimizations'''
    
    def __init__(self, server, port, debug=0, do_expunge=False):
        try:
            BaseIMAP.__init__(self, server, port)
        except:
            # A more specific except would be good here, but I get
            # (in Python 2.2) a generic 'error' and a 'gaierror'
            # if I pass a valid domain that isn't an IMAP server
            # or invalid domain (respectively)
            print "Invalid server or port, please check these settings."
            sys.exit(-1)
        self.debug = debug
        # For efficiency, we remember which folder we are currently
        # in, and only send a select command to the IMAP server if
        # we want to *change* folders.  This function is used by
        # both IMAPMessage and IMAPFolder.
        self.current_folder = None
        self.do_expunge = do_expunge

    def login(self, username, pwd):
        try:
            BaseIMAP.login(self, username, pwd)  # superclass login
        except BaseIMAP.error, e:
            if str(e) == "permission denied":
                print "There was an error logging in to the IMAP server."
                print "The userid and/or password may be incorrect."
                sys.exit()
            else:
                raise
    
    def logout(self):
        # sign off
        if self.do_expunge:
            self.expunge()
        BaseIMAP.logout(self)  # superclass logout
        
    def SelectFolder(self, folder):
        '''A method to point ensuing imap operations at a target folder'''
        if self.current_folder != folder:
            if self.current_folder != None:
                if self.do_expunge:
                    # It is faster to do close() than a single
                    # expunge when we log out (because expunge returns
                    # a list of all the deleted messages, that we don't do
                    # anything with)
                    imap.close()
            # We *always* use SELECT and not EXAMINE, because this
            # speeds things up considerably.
            response = self.select(folder, None)
            if response[0] != "OK":
                print "Invalid response to select %s:\n%s" % (folder,
                                                              response)
                sys.exit(-1)
            self.current_folder = folder
            return response

    def folder_list(self):
        '''Return a alphabetical list of all folders available on the
        server'''
        response = self.list()
        if response[0] != "OK":
            return []
        all_folders = response[1]
        folders = []
        for fol in all_folders:
            # Sigh.  Some servers may give us back the folder name as a
            # literal, so we need to crunch this out.
            if isinstance(fol, ()):
                r = re.compile(r"{\d+}")
                m = r.search(fol[0])
                if not m:
                    # Something is wrong here!  Skip this folder
                    continue
                fol = '%s"%s"' % (fol[0][:m.start()], fol[1])
            r = re.compile(r"\(([\w\\ ]*)\) ")
            m = r.search(fol)
            if not m:
                # Something is not good with this folder, so skip it.
                continue
            name_attributes = fol[:m.end()-1]
            # IMAP is a truly odd protocol.  The delimiter is
            # only the delimiter for this particular folder - each
            # folder *may* have a different delimiter
            self.folder_delimiter = fol[m.end()+1:m.end()+2]
            # a bit of a hack, but we really need to know if this is
            # the case
            if self.folder_delimiter == ',':
                print """WARNING: Your imap server uses commas as the folder
                delimiter.  This may cause unpredictable errors."""
            folders.append(fol[m.end()+5:-1])
        folders.sort()
        return folders

    def FindMessage(self, id):
        '''A (potentially very expensive) method to find a message with
        a given spambayes id (header), and return a message object (no
        substance).'''
        # If efficiency becomes a concern, what we could do is store a
        # dict of key-to-folder, and look in that folder first.  (It might
        # have moved independantly of us, so we would still have to search
        # if we didn't find it).  For the moment, we do an search through
        # all folders, alphabetically.
        for folder_name in self.folder_list():
            fol = IMAPFolder(folder_name)
            for msg in fol:
                if msg.id == id:
                    return msg
        return None

class IMAPMessage(message.SBHeaderMessage):
    def __init__(self):
        message.Message.__init__(self)
        self.folder = None
        self.previous_folder = None
        self.rfc822_command = "RFC822.PEEK"
        self.got_substance = False

    def setFolder(self, folder):
        self.folder = folder

    def _check(self, response, command):
        if response[0] != "OK":
            print "Invalid response to %s:\n%s" % (command, response)
            sys.exit(-1)

    def extractTime(self):
        # When we create a new copy of a message, we need to specify
        # a timestamp for the message.  If the message has a valid date
        # header we use that.  Otherwise, we use the current time.
        message_date = self["Date"]
        if message_date is not None:
            parsed_date = parsedate(message_date)
            if parsed_date is not None:
                return Time2Internaldate(time.mktime(parsed_date))
        else:
            return Time2Internaldate(time.time())

    def get_substance(self):
        '''Retrieve the RFC822 message from the IMAP server and set as the
        substance of this message.'''
        if self.got_substance:
            return
        if not self.uid or not self.id:
            print "Cannot get substance of message without an id and an UID"
            return
        imap.SelectFolder(self.folder.name)
        # We really want to use RFC822.PEEK here, as that doesn't effect
        # the status of the message.  Unfortunately, it appears that not
        # all IMAP servers support this, even though it is in RFC1730
        # Actually, it's not: we should be using BODY.PEEK
        try:
            response = imap.uid("FETCH", self.uid, self.rfc822_command)
        except IMAP4.error:
            self.rfc822_command = "RFC822"
            response = imap.uid("FETCH", self.uid, self.rfc822_command)
        if response[0] != "OK":
            self.rfc822_command = "RFC822"
            response = imap.uid("FETCH", self.uid, self.rfc822_command)
        self._check(response, "uid fetch")
        data = _extract_fetch_data(response[1][0])
        # Annoyingly, we can't just pass over the RFC822 message to an
        # existing message object (like self) and have it parse it. So
        # we go through the hoops of creating a new message, and then
        # copying over all its internals.
        new_msg = email.Parser.Parser().parsestr(data["RFC822"])
        self._headers = new_msg._headers
        self._unixfrom = new_msg._unixfrom
        self._payload = new_msg._payload
        self._charset = new_msg._charset
        self.preamble = new_msg.preamble
        self.epilogue = new_msg.epilogue
        self._default_type = new_msg._default_type
        if not self.has_key(options["Headers", "mailid_header_name"]):
            self[options["Headers", "mailid_header_name"]] = self.id
        self.got_substance = True
        if options["globals", "verbose"]:
            sys.stdout.write(chr(8) + "*")

    def MoveTo(self, dest):
        '''Note that message should move to another folder.  No move is
        carried out until Save() is called, for efficiency.'''
        if self.previous_folder is None:
            self.previous_folder = self.folder
        self.folder = dest

    def Save(self):
        '''Save message to imap server.'''
        # we can't actually update the message with IMAP
        # so what we do is create a new message and delete the old one
        if self.folder is None:
            raise RuntimeError, """Can't save a message that doesn't
            have a folder."""
        if not self.id:
            raise RuntimeError, """Can't save a message that doesn't have
            an id."""
        response = imap.uid("FETCH", self.uid, "(FLAGS INTERNALDATE)")
        self._check(response, 'fetch (flags internaldate)')
        data = _extract_fetch_data(response[1][0])
        if data.has_key("INTERNALDATE"):
            msg_time = data["INTERNALDATE"]
        else:
            msg_time = self.extractTime()
        if data.has_key("FLAGS"):
            flags = data["FLAGS"]
            # The \Recent flag can be fetched, but cannot be stored
            # We must remove it from the list if it is there.
            flags = re.sub(r"\\Recent ?|\\ ?Recent", "", flags)
        else:
            flags = None

        response = imap.append(self.folder.name, flags,
                               msg_time, self.as_string())
        if response[0] == "NO":
            # This may be because we have tried to set an invalid flag.
            # Try again, losing all the flag information, but warn the
            # user that this has happened.
            response = imap.append(self.folder.name, None, msg_time,
                                   self.as_string())
            if response[0] == "OK":
                print "WARNING: Could not append flags: %s" % (flags,)
        self._check(response, 'append')

        if self.previous_folder is None:
            imap.SelectFolder(self.folder.name)
        else:
            imap.SelectFolder(self.previous_folder.name)
            self.previous_folder = None
        response = imap.uid("STORE", self.uid, "+FLAGS.SILENT", "(\\Deleted)")
        self._check(response, 'store')

        # We need to update the uid, as it will have changed.
        # Although we don't use the UID to keep track of messages, we do
        # have to use it for IMAP operations.
        imap.SelectFolder(self.folder.name)
        response = imap.uid("SEARCH", "(UNDELETED HEADER " + \
                            options["Headers", "mailid_header_name"] + \
                            " " + self.id + ")")
        self._check(response, 'search')
        new_id = response[1][0]
        # Let's hope it doesn't, but, just in case, if the search
        # turns up empty, we make the assumption that the new
        # message is the last one with a recent flag
        if new_id == "":
            response = imap.uid("SEARCH", "RECENT")
            new_id = response[1][0]
            if new_id.find(' ') > -1:
                ids = new_id.split(' ')
                new_id = ids[-1]
            # Ok, now we're in trouble if we still haven't found it.
            # We make a huge assumption that the new message is the one
            # with the highest UID (they are sequential, so this will be
            # ok as long as another message hasn't also arrived)
            if new_id == "":
                response = imap.uid("SEARCH", "ALL")
                new_id = response[1][0]
                if new_id.find(' ') > -1:
                    ids = new_id.split(' ')
                    new_id = ids[-1]
        self.uid = new_id

# This performs a similar function to email.message_from_string()
def imapmessage_from_string(s, _class=IMAPMessage, strict=False):
    return email.message_from_string(s, _class, strict)


class IMAPFolder(object):
    def __init__(self, folder_name):
        self.name = folder_name
        # Unique names for cached messages - see _generate_id below.
        self.lastBaseMessageName = ''
        self.uniquifier = 2

    def __cmp__(self, obj):
        '''Two folders are equal if their names are equal'''
        if obj is None:
            return False
        return cmp(self.name, obj.name)

    def _check(self, response, command):
        if response[0] != "OK":
            print "Invalid response to %s:\n%s" % (command, response)
            sys.exit(-1)

    def __iter__(self):
        '''IMAPFolder is iterable'''
        for key in self.keys():
            try:
                yield self[key]
            except KeyError:
                pass

    def recent_uids(self):
        '''Returns uids for all the messages in the folder that
        are flagged as recent, but not flagged as deleted.'''
        imap.SelectFolder(self.name, True)
        response = imap.uid("SEARCH", "RECENT UNDELETED")
        self._check(response, "SEARCH RECENT UNDELETED")
        return response[1][0].split(' ')

    def keys(self):
        '''Returns *uids* for all the messages in the folder not
        marked as deleted.'''
        imap.SelectFolder(self.name)
        response = imap.uid("SEARCH", "UNDELETED")
        self._check(response, "SEARCH UNDELETED")
        if response[1][0] == "":
            return []
        return response[1][0].split(' ')

    def __getitem__(self, key):
        '''Return message (no substance) matching the given *uid*.'''
        # We don't retrieve the substances of the message here - you need
        # to call msg.get_substance() to do that.
        imap.SelectFolder(self.name)
        # Using RFC822.HEADER.LINES would be better here, but it seems
        # that not all servers accept it, even though it is in the RFC
        response = imap.uid("FETCH", key, "RFC822.HEADER")
        self._check(response, "uid fetch header")
        data = _extract_fetch_data(response[1][0])

        msg = IMAPMessage()
        msg.setFolder(self)
        msg.uid = key
        r = re.compile(re.escape(options["Headers",
                                         "mailid_header_name"]) + \
                       "\:\s*(\d+(\-\d)?)")
        mo = r.search(data["RFC822.HEADER"])
        if mo is None:
            msg.setId(self._generate_id())
            # Unfortunately, we now have to re-save this message, so that
            # our id is stored on the IMAP server.  Before anyone suggests
            # it, we can't store it as a flag, because user-defined flags
            # aren't supported by all IMAP servers.
            # This will need to be done once per message.
            msg.get_substance()
            msg.Save()
        else:
            msg.setId(mo.group(1))

        if options["globals", "verbose"]:
            sys.stdout.write(".")
        return msg

    # Lifted straight from pop3proxy.py (under the name getNewMessageName)
    def _generate_id(self):
        # The message id is the time it arrived, with a uniquifier
        # appended if two arrive within one clock tick of each other.
        messageName = "%10.10d" % long(time.time())
        if messageName == self.lastBaseMessageName:
            messageName = "%s-%d" % (messageName, self.uniquifier)
            self.uniquifier += 1
        else:
            self.lastBaseMessageName = messageName
            self.uniquifier = 2
        return messageName

    def Train(self, classifier, isSpam):
        '''Train folder as spam/ham'''
        num_trained = 0
        for msg in self:
            if msg.GetTrained() == (not isSpam):
                msg.get_substance()
                msg.delSBHeaders()
                classifier.unlearn(msg.asTokens(), not isSpam)
                # Once the message has been untrained, it's training memory
                # should reflect that on the off chance that for some reason
                # the training breaks, which happens on occasion (the
                # tokenizer is not yet perfect)
                msg.RememberTrained(None)

            if msg.GetTrained() is None:
                msg.get_substance()
                msg.delSBHeaders()
                classifier.learn(msg.asTokens(), isSpam)
                num_trained += 1
                msg.RememberTrained(isSpam)
                if isSpam:
                    move_opt_name = "move_trained_spam_to_folder"
                else:
                    move_opt_name = "move_trained_ham_to_folder"
                if options["imap", move_opt_name] != "":
                    msg.MoveTo(IMAPFolder(options["imap",
                                                  move_opt_name]))
                    msg.Save()
        return num_trained                

    def Filter(self, classifier, spamfolder, unsurefolder):
        count = {}
        count["ham"] = 0
        count["spam"] = 0
        count["unsure"] = 0
        for msg in self:
            if msg.GetClassification() is None:
                msg.get_substance()
                (prob, clues) = classifier.spamprob(msg.asTokens(),
                                                    evidence=True)
                # add headers and remember classification
                msg.addSBHeaders(prob, clues)

                cls = msg.GetClassification()
                if cls == options["Hammie", "header_ham_string"]:
                    # we leave ham alone
                    count["ham"] += 1
                elif cls == options["Hammie", "header_spam_string"]:
                    msg.MoveTo(spamfolder)
                    count["spam"] += 1
                else:
                    msg.MoveTo(unsurefolder)
                    count["unsure"] += 1
                msg.Save()
        return count


class IMAPFilter(object):
    def __init__(self, classifier):
        self.spam_folder = IMAPFolder(options["imap", "spam_folder"])
        self.unsure_folder = IMAPFolder(options["imap", "unsure_folder"])
        self.classifier = classifier
        
    def Train(self):
        if options["globals", "verbose"]:
            t = time.time()
            
        total_ham_trained = 0
        total_spam_trained = 0

        if options["imap", "ham_train_folders"] != "":
            ham_training_folders = options["imap", "ham_train_folders"]
            for fol in ham_training_folders:
                # Select the folder to make sure it exists
                imap.SelectFolder(fol)
                if options['globals', 'verbose']:
                    print "   Training ham folder %s" % (fol)
                folder = IMAPFolder(fol)
                num_ham_trained = folder.Train(self.classifier, False)
                total_ham_trained += num_ham_trained
                if options['globals', 'verbose']:
                    print "       %s trained." % (num_ham_trained)

        if options["imap", "spam_train_folders"] != "":
            spam_training_folders = options["imap", "spam_train_folders"]
            for fol in spam_training_folders:
                # Select the folder to make sure it exists
                imap.SelectFolder(fol)
                if options['globals', 'verbose']:
                    print "   Training spam folder %s" % (fol)
                folder = IMAPFolder(fol)
                num_spam_trained = folder.Train(self.classifier, True)
                total_spam_trained += num_spam_trained
                if options['globals', 'verbose']:
                    print "       %s trained." % (num_spam_trained)

        if total_ham_trained or total_spam_trained:
            self.classifier.store()
        
        if options["globals", "verbose"]:
            print "Training took %s seconds, %s messages were trained" \
                  % (time.time() - t, total_ham_trained + total_spam_trained)

    def Filter(self):
        if options["globals", "verbose"]:
            t = time.time()
            count = None

        # Select the spam folder and unsure folder to make sure they exist
        imap.SelectFolder(self.spam_folder.name)
        imap.SelectFolder(self.unsure_folder.name)
            
        for filter_folder in options["imap", "filter_folders"]:
            # Select the folder to make sure it exists
            imap.SelectFolder(filter_folder)
            folder = IMAPFolder(filter_folder)
            count = folder.Filter(self.classifier, self.spam_folder,
                          self.unsure_folder)
 
        if options["globals", "verbose"]:
            if count is not None:
                print "\nClassified %s ham, %s spam, and %s unsure." % \
                      (count["ham"], count["spam"], count["unsure"])
            print "Classifying took", time.time() - t, "seconds."

 
def run():
    global imap
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hbtcvpl:e:i:d:D:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    bdbname = options["Storage", "persistent_storage_file"]
    useDBM = options["Storage", "persistent_use_database"]
    doTrain = False
    doClassify = False
    doExpunge = options["imap", "expunge"]
    imapDebug = 0
    sleepTime = 0
    promptForPass = False
    launchUI = False
    server = ""
    username = ""

    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-d':
            useDBM = False
            bdbname = arg
        elif opt == '-D':
            useDBM = True
            bdbname = arg
        elif opt == "-b":
            launchUI = True
        elif opt == '-t':
            doTrain = True
        elif opt == '-p':
            promptForPass = True
        elif opt == '-c':
            doClassify = True
        elif opt == '-v':
            options["globals", "verbose"] = True
        elif opt == '-e':
            if arg == 'y':
                doExpunge = True
            else:
                doExpunge = False
        elif opt == '-i':
            imapDebug = int(arg)
        elif opt == '-l':
            sleepTime = int(arg) * 60

    # Let the user know what they are using...
    print get_version_string("IMAP Filter")
    print "and engine %s.\n" % (get_version_string(),)

    if not (doClassify or doTrain or launchUI):
        print "-b, -c, or -t operands must be specified."
        print "Please use the -h operand for help."
        sys.exit()

    if (launchUI and (doClassify or doTrain)):
        print """
-b option is exclusive with -c and -t options.
The user interface will be launched, but no classification
or training will be performed."""

    bdbname = os.path.expanduser(bdbname)
    
    if options["globals", "verbose"]:
        print "Loading database %s..." % (bdbname),

    classifier = storage.open_storage(bdbname, useDBM)

    if options["globals", "verbose"]:
        print "Done."            

    if options["imap", "server"]:
        # The options class is ahead of us here:
        #   it knows that imap:server will eventually be able to have
        #   multiple values, but for the moment, we just use the first one
        server = options["imap", "server"]
        if len(server) > 0:
            server = server[0]
        username = options["imap", "username"]
        if len(username) > 0:
            username = username[0]
        if not promptForPass:
            pwd = options["imap", "password"]
            if len(pwd) > 0:
                pwd = pwd[0]
    else:
        pwd = None
        if not launchUI:
            print "You need to specify both a server and a username."
            sys.exit()

    if promptForPass:
        pwd = getpass()

    if server.find(':') > -1:
        server, port = server.split(':', 1)
        port = int(port)
    else:
        if options["imap", "use_ssl"]:
            port = 993
        else:
            port = 143

    imap_filter = IMAPFilter(classifier)

    # Web interface
    if launchUI:
        if server != "":
            imap = IMAPSession(server, port, imapDebug, doExpunge)
        httpServer = UserInterfaceServer(options["html_ui", "port"])
        httpServer.register(IMAPUserInterface(classifier, imap, pwd))
        Dibbler.run(launchBrowser=launchUI)
    else:
        while True:
            imap = IMAPSession(server, port, imapDebug, doExpunge)
            imap.login(username, pwd)

            if doTrain:
                if options["globals", "verbose"]:
                    print "Training"
                imap_filter.Train()
            if doClassify:
                if options["globals", "verbose"]:
                    print "Classifying"
                imap_filter.Filter()

            imap.logout()
            
            if sleepTime:
                time.sleep(sleepTime)
            else:
                break

if __name__ == '__main__':
    run()

--- NEW FILE: sb_mailsort.py ---
#! /usr/bin/env python
"""\
To train:
    %(program)s -t ham.mbox spam.mbox

To filter mail (using .forward or .qmail):
    |%(program)s Maildir/ Mail/Spam/

To print the score and top evidence for a message or messages:
    %(program)s -s message [message ...]
"""

SPAM_CUTOFF = 0.57

SIZE_LIMIT = 5000000 # messages larger are not analyzed
BLOCK_SIZE = 10000
RC_DIR = "~/.spambayes"
DB_FILE = RC_DIR + "/wordprobs.cdb"
CONFIG_FILE = RC_DIR + "/bayescustomize.ini"

import sys
import os
import getopt
import email
import time
import signal
import socket
import email

DB_FILE = os.path.expanduser(DB_FILE)

def import_spambayes():
    global mboxutils, CdbClassifier, tokenize
    if not os.environ.has_key('BAYESCUSTOMIZE'):
        os.environ['BAYESCUSTOMIZE'] = os.path.expanduser(CONFIG_FILE)
    from spambayes import mboxutils
    from spambayes.cdb_classifier import CdbClassifier
    from spambayes.tokenizer import tokenize


try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


program = sys.argv[0] # For usage(); referenced by docstring above

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def maketmp(dir):
    hostname = socket.gethostname()
    pid = os.getpid()
    fd = -1
    for x in xrange(200):
        filename = "%d.%d.%s" % (time.time(), pid, hostname)
        pathname = "%s/tmp/%s" % (dir, filename)
        try:
            fd = os.open(pathname, os.O_WRONLY|os.O_CREAT|os.O_EXCL, 0600)
        except IOError, exc:
            if exc[i] not in (errno.EINT, errno.EEXIST):
                raise
        else:
            break
        time.sleep(2)
    if fd == -1:
        raise SystemExit, "could not create a mail file"
    return (os.fdopen(fd, "wb"), pathname, filename)

def train(bayes, msgs, is_spam):
    """Train bayes with all messages from a mailbox."""
    mbox = mboxutils.getmbox(msgs)
    for msg in mbox:
        bayes.learn(tokenize(msg), is_spam)

def train_messages(ham_name, spam_name):
    """Create database using messages."""

    rc_dir = os.path.expanduser(RC_DIR)
    if not os.path.exists(rc_dir):
        print "Creating", RC_DIR, "directory..."
        os.mkdir(rc_dir)
    bayes = CdbClassifier()
    print 'Training with ham...'
    train(bayes, ham_name, False)
    print 'Training with spam...'
    train(bayes, spam_name, True)
    print 'Update probabilities and writing DB...'
    db = open(DB_FILE, "wb")
    bayes.save_wordinfo(db)
    db.close()
    print 'done'

def filter_message(hamdir, spamdir):
    signal.signal(signal.SIGALRM, lambda s: sys.exit(1))
    signal.alarm(24 * 60 * 60)

    # write message to temporary file (must be on same partition)
    tmpfile, pathname, filename = maketmp(hamdir)
    try:
        tmpfile.write(os.environ.get("DTLINE", "")) # delivered-to line
        bytes = 0
        blocks = []
        while 1:
            block = sys.stdin.read(BLOCK_SIZE)
            if not block:
                break
            bytes += len(block)
            if bytes < SIZE_LIMIT:
                blocks.append(block)
            tmpfile.write(block)
        tmpfile.close()

        if bytes < SIZE_LIMIT:
            msgdata = ''.join(blocks)
            del blocks
            msg = email.message_from_string(msgdata)
            del msgdata
            bayes = CdbClassifier(open(DB_FILE, 'rb'))
            prob = bayes.spamprob(tokenize(msg))
        else:
            prob = 0.0

        if prob > SPAM_CUTOFF:
            os.rename(pathname, "%s/new/%s" % (spamdir, filename))
        else:
            os.rename(pathname, "%s/new/%s" % (hamdir, filename))
    except:
        os.unlink(pathname)
        raise

def print_message_score(msg_name, msg_fp):
    msg = email.message_from_file(msg_fp)
    bayes = CdbClassifier(open(DB_FILE, 'rb'))
    prob, evidence = bayes.spamprob(tokenize(msg), evidence=True)
    print msg_name, prob
    for word, prob in evidence:
        print '  ', `word`, prob

def main():
    global DB_FILE, CONFIG_FILE

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'tsd:c:')
    except getopt.error, msg:
        usage(2, msg)

    mode = 'sort'
    for opt, val in opts:
        if opt == '-t':
            mode = 'train'
        elif opt == '-s':
            mode = 'score'
        elif opt == '-d':
            DB_FILE = val
        elif opt == '-c':
            CONFIG_FILE = val
        else:
            assert 0, 'invalid option'

    import_spambayes()

    if mode == 'sort':
        if len(args) != 2:
            usage(2, 'wrong number of arguments')
        filter_message(args[0], args[1])
    elif mode == 'train':
        if len(args) != 2:
            usage(2, 'wrong number of arguments')
        train_messages(args[0], args[1])
    elif mode == 'score':
        if args:
            for msg in args:
                print_message_score(msg, open(msg))
        else:
            print_message_score('<stdin>', sys.stdin)


if __name__ == "__main__":
    main()

--- NEW FILE: sb_mboxtrain.py ---
#! /usr/bin/env python

### Train spambayes on all previously-untrained messages in a mailbox.
###
### This keeps track of messages it's already trained by adding an
### X-Spambayes-Trained: header to each one.  Then, if you move one to
### another folder, it will retrain that message.  You would want to run
### this from a cron job on your server.

"""Usage: %(program)s [OPTIONS] ...

Where OPTIONS is one or more of:
    -h
        show usage and exit
    -d DBNAME
        use the DBM store.  A DBM file is larger than the pickle and
        creating it is slower, but loading it is much faster,
        especially for large word databases.  Recommended for use with
        hammiefilter or any procmail-based filter.
    -D DBNAME
        use the pickle store.  A pickle is smaller and faster to create,
        but much slower to load.  Recommended for use with pop3proxy and
        hammiesrv.
    -g PATH
        mbox or directory of known good messages (non-spam) to train on.
        Can be specified more than once.
    -s PATH
        mbox or directory of known spam messages to train on.
        Can be specified more than once.
    -f
        force training, ignoring the trained header.  Use this if you
        need to rebuild your database from scratch.
    -q
        quiet mode; no output
        
    -n  train mail residing in "new" directory, in addition to "cur" directory,
        which is always trained (Maildir only)

    -r  remove mail which was trained on (Maildir only)
"""

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

import sys, os, getopt
from spambayes import hammie, mboxutils
from spambayes.Options import options

program = sys.argv[0]
TRAINED_HDR = "X-Spambayes-Trained"
loud = True

def msg_train(h, msg, is_spam, force):
    """Train bayes with a single message."""

    # XXX: big hack -- why is email.Message unable to represent
    # multipart/alternative?
    try:
        msg.as_string()
    except TypeError:
        # We'll be unable to represent this as text :(
        return False

    if is_spam:
        spamtxt = options["Headers", "header_spam_string"]
    else:
        spamtxt = options["Headers", "header_ham_string"]
    oldtxt = msg.get(TRAINED_HDR)
    if force:
        # Train no matter what.
        if oldtxt != None:
            del msg[TRAINED_HDR]
    elif oldtxt == spamtxt:
        # Skip this one, we've already trained with it.
        return False
    elif oldtxt != None:
        # It's been trained, but as something else.  Untrain.
        del msg[TRAINED_HDR]
        h.untrain(msg, not is_spam)
    h.train(msg, is_spam)
    msg.add_header(TRAINED_HDR, spamtxt)

    return True

def maildir_train(h, path, is_spam, force, removetrained):
    """Train bayes with all messages from a maildir."""

    if loud: print "  Reading %s as Maildir" % (path,)

    import time
    import socket

    pid = os.getpid()
    host = socket.gethostname()
    counter = 0
    trained = 0

    for fn in os.listdir(path):
        cfn = os.path.join(path, fn)
        tfn = os.path.normpath(os.path.join(path, "..", "tmp",
                           "%d.%d_%d.%s" % (time.time(), pid,
                                            counter, host)))
        if (os.path.isdir(cfn)):
            continue
        counter += 1
        if loud:
            sys.stdout.write("  %s        \r" % fn)
            sys.stdout.flush()
        f = file(cfn, "rb")
        msg = mboxutils.get_message(f)
        f.close()
        if not msg_train(h, msg, is_spam, force):
            continue
        trained += 1
        f = file(tfn, "wb")
        f.write(msg.as_string())
        f.close()
        # XXX: This will raise an exception on Windows.  Do any Windows
        # people actually use Maildirs?
        os.rename(tfn, cfn)
        if (removetrained):
            os.unlink(cfn)

    if loud:
        print ("  Trained %d out of %d messages                " %
               (trained, counter))

def mbox_train(h, path, is_spam, force):
    """Train bayes with a Unix mbox"""

    if loud: print "  Reading as Unix mbox"

    import mailbox
    import fcntl
    import tempfile

    # Open and lock the mailbox.  Some systems require it be opened for
    # writes in order to assert an exclusive lock.
    f = file(path, "r+b")
    fcntl.flock(f, fcntl.LOCK_EX)
    mbox = mailbox.PortableUnixMailbox(f, mboxutils.get_message)

    outf = os.tmpfile()
    counter = 0
    trained = 0

    for msg in mbox:
        counter += 1
        if loud:
            sys.stdout.write("  %s\r" % counter)
            sys.stdout.flush()
        if msg_train(h, msg, is_spam, force):
            trained += 1
        # Write it out with the Unix "From " line
        outf.write(msg.as_string(True))

    outf.seek(0)
    try:
        os.ftruncate(f.fileno(), 0)
        f.seek(0)
    except:
        # If anything goes wrong, don't try to write
        print "Problem truncating mbox--nothing written"
        raise
    try:
        for line in outf.xreadlines():
            f.write(line)
    except:
        print >> sys.stderr ("Problem writing mbox!  Sorry, "
                             "I tried my best, but your mail "
                             "may be corrupted.")
        raise
    fcntl.lockf(f, fcntl.LOCK_UN)
    f.close()
    if loud:
        print ("  Trained %d out of %d messages                " %
               (trained, counter))

def mhdir_train(h, path, is_spam, force):
    """Train bayes with an mh directory"""

    if loud: print "  Reading as MH mailbox"

    import glob

    counter = 0
    trained = 0

    for fn in glob.glob(os.path.join(path, "[0-9]*")):
        counter += 1

        cfn = fn
        tfn = os.path.join(path, "spambayes.tmp")
        if loud:
            sys.stdout.write("  %s        \r" % fn)
            sys.stdout.flush()
        f = file(fn, "rb")
        msg = mboxutils.get_message(f)
        f.close()
        msg_train(h, msg, is_spam, force)
        trained += 1
        f = file(tfn, "wb")
        f.write(msg.as_string())
        f.close()

        # XXX: This will raise an exception on Windows.  Do any Windows
        # people actually use MH directories?
        os.rename(tfn, cfn)

    if loud:
        print ("  Trained %d out of %d messages                " %
               (trained, counter))

def train(h, path, is_spam, force, trainnew, removetrained):
    if not os.path.exists(path):
        raise ValueError("Nonexistent path: %s" % path)
    elif os.path.isfile(path):
        mbox_train(h, path, is_spam, force)
    elif os.path.isdir(os.path.join(path, "cur")):
        maildir_train(h, os.path.join(path, "cur"), is_spam, force, removetrained)
        if trainnew:
            maildir_train(h, os.path.join(path, "new"), is_spam, force, removetrained)
    elif os.path.isdir(path):
        mhdir_train(h, path, is_spam, force)
    else:
        raise ValueError("Unable to determine mailbox type: " + path)


def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def main():
    """Main program; parse options and go."""

    global loud

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hfqnrd:D:g:s:')
    except getopt.error, msg:
        usage(2, msg)

    if not opts:
        usage(2, "No options given")

    pck = None
    usedb = None
    force = False
    trainnew = False
    removetrained = False
    good = []
    spam = []
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == "-f":
            force = True
        elif opt == "-n":
            trainnew = True
        elif opt == "-q":
            loud = False
        elif opt == '-g':
            good.append(arg)
        elif opt == '-s':
            spam.append(arg)
        elif opt == "-r":
            removetrained = True
        elif opt == "-d":
            usedb = True
            pck = arg
        elif opt == "-D":
            usedb = False
            pck = arg
    if args:
        usage(2, "Positional arguments not allowed")

    if usedb == None:
        usage(2, "Must specify one of -d or -D")

    h = hammie.open(pck, usedb, "c")

    for g in good:
        if loud: print "Training ham (%s):" % g
        train(h, g, False, force, trainnew, removetrained)
        save = True

    for s in spam:
        if loud: print "Training spam (%s):" % s
        train(h, s, True, force, trainnew, removetrained)
        save = True

    if save:
        h.store()


if __name__ == "__main__":
    main()

--- NEW FILE: sb_notesfilter.py ---
#! /usr/bin/env python

'''notesfilter.py - Lotus Notes Spambayes interface.

Classes:

Abstract:

    This module uses Spambayes as a filter against a Lotus Notes mail
    database.  The Notes client must be running when this process is
    executed.
    
    It requires a Notes folder, named as a parameter, with four
    subfolders:
        Spam
        Ham
        Train as Spam
        Train as Ham

    Depending on the execution parameters, it will do any or all of the
    following steps, in the order given.

    1. Train Spam from the Train as Spam folder (-t option)
    2. Train Ham from the Train as Ham folder (-t option)
    3. Replicate (-r option)
    4. Classify the inbox (-c option)
        
    Mail that is to be trained as spam should be manually moved to
    that folder by the user. Likewise mail that is to be trained as
    ham.  After training, spam is moved to the Spam folder and ham is
    moved to the Ham folder.

    Replication takes place if a remote server has been specified.
    This step may take a long time, depending on replication
    parameters and how much information there is to download, as well
    as line speed and server load.  Please be patient if you run with
    replication.  There is currently no progress bar or anything like
    that to tell you that it's working, but it is and will complete
    eventually.  There is also no mechanism for notifying you that the
    replication failed.  If it did, there is no harm done, and the program
    will continue execution.

    Mail that is classified as Spam is moved from the inbox to the
    Train as Spam folder.  You should occasionally review your Spam
    folder for Ham that has mistakenly been classified as Spam.  If
    there is any there, move it to the Train as Ham folder, so
    Spambayes will be less likely to make this mistake again.

    Mail that is classified as Ham or Unsure is left in the inbox.
    There is currently no means of telling if a mail was classified as
    Ham or Unsure.

    You should occasionally select some Ham and move it to the Train
    as Ham folder, so Spambayes can tell the difference between Spam
    and Ham. The goal is to maintain a relative balance between the
    number of Spam and the number of Ham that have been trained into
    the database. These numbers are reported every time this program
    executes.  However, if the amount of Spam you receive far exceeds
    the amount of Ham you receive, it may be very difficult to
    maintain this balance.  This is not a matter of great concern.
    Spambayes will still make very few mistakes in this circumstance.
    But, if this is the case, you should review your Spam folder for
    falsely classified Ham, and retrain those that you find, on a
    regular basis.  This will prevent statistical error accumulation,
    which if allowed to continue, would cause Spambayes to tend to
    classify everything as Spam.
    
    Because there is no programmatic way to determine if a particular
    mail has been previously processed by this classification program,
    it keeps a pickled dictionary of notes mail ids, so that once a
    mail has been classified, it will not be classified again.  The
    non-existence of is index file, named <local database>.sbindex,
    indicates to the system that this is an initialization execution.
    Rather than classify the inbox in this case, the contents of the
    inbox are placed in the index to note the 'starting point' of the
    system.  After that, any new messages in the inbox are eligible
    for classification.

Usage:
    notesfilter [options]

	note: option values with spaces in them must be enclosed
	      in double quotes

        options:
            -d  dbname  : pickled training database filename
            -D  dbname  : dbm training database filename
            -l  dbname  : database filename of local mail replica
                            e.g. localmail.nsf
            -r  server  : server address of the server mail database
                            e.g. d27ml602/27/M/IBM
                          if specified, will initiate a replication
            -f  folder  : Name of spambayes folder
                            must have subfolders: Spam
                                                  Ham
                                                  Train as Spam
                                                  Train as Ham
            -t          : train contents of Train as Spam and Train as Ham
            -c          : classify inbox
            -h          : help
            -p          : prompt "Press Enter to end" before ending
                          This is useful for automated executions where the
                          statistics output would otherwise be lost when the
                          window closes.

Examples:

    Replicate and classify inbox
        notesfilter -c -d notesbayes -r mynoteserv -l mail.nsf -f Spambayes
        
    Train Spam and Ham, then classify inbox
        notesfilter -t -c -d notesbayes -l mail.nsf -f Spambayes
    
    Replicate, then classify inbox      
        notesfilter -c -d test7 -l mail.nsf -r nynoteserv -f Spambayes
 
To Do:
    o Dump/purge notesindex file
    o Create correct folders if they do not exist
    o Options for some of this stuff?
    o pop3proxy style training/configuration interface?
    o parameter to retrain?
    o Suggestions?
    '''

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tim Stone <tim at fourstonesExpressions.com>"
__credits__ = "Mark Hammond, for his remarkable win32 modules."

from __future__ import generators

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
    def bool(val):
        return not not val

import sys
from spambayes import tokenizer, storage
from spambayes.Options import options
import cPickle as pickle
import errno
import win32com.client
import pywintypes
import getopt


def classifyInbox(v, vmoveto, bayes, ldbname, notesindex):

    # the notesindex hash ensures that a message is looked at only once

    if len(notesindex.keys()) == 0:
        firsttime = 1
    else:
        firsttime = 0
        
    docstomove = []
    numham = 0
    numspam = 0
    numuns = 0
    numdocs = 0
    
    doc = v.GetFirstDocument()
    while doc:
        nid = doc.NOTEID
        if firsttime:
           notesindex[nid] = 'never classified'
        else:
            if not notesindex.has_key(nid):

                numdocs += 1

                # Notes returns strings in unicode, and the Python
                # uni-decoder has trouble with these strings when
                # you try to print them.  So don't...

                # The com interface returns basic data types as tuples
                # only, thus the subscript on GetItemValue
                
                try:
                    subj = doc.GetItemValue('Subject')[0]
                except:
                    subj = 'No Subject'

                try:
                    body  = doc.GetItemValue('Body')[0]
                except:
                    body = 'No Body'

                message = "Subject: %s\r\n\r\n%s" % (subj, body)

                # generate_long_skips = True blows up on occasion,
                # probably due to this unicode problem.
                options["Tokenizer", "generate_long_skips"] = False
                tokens = tokenizer.tokenize(message)
                prob, clues = bayes.spamprob(tokens, evidence=True)

                if prob < options["Categorization", "ham_cutoff"]:
                    disposition = options["Hammie", "header_ham_string"]
                    numham += 1
                elif prob > options["Categorization", "spam_cutoff"]:
                    disposition = options["Hammie", "header_spam_string"]
                    docstomove += [doc]
                    numspam += 1
                else:
                    disposition = options["Hammie", "header_unsure_string"]
                    numuns += 1

                notesindex[nid] = 'classified'
                try:
                    print "%s spamprob is %s" % (subj[:30], prob)
                except UnicodeError:
                    print "<subject not printed> spamprob is %s" % (prob)

        doc = v.GetNextDocument(doc)

    # docstomove list is built because moving documents in the middle of
    # the classification loop looses the iterator position
    for doc in docstomove:
        doc.RemoveFromFolder(v.Name)
        doc.PutInFolder(vmoveto.Name)

    print "%s documents processed" % (numdocs)
    print "   %s classified as spam" % (numspam)
    print "   %s classified as ham" % (numham)
    print "   %s classified as unsure" % (numuns)
    

def processAndTrain(v, vmoveto, bayes, is_spam, notesindex):

    if is_spam:
        str = options["Hammie", "header_spam_string"]
    else:
        str = options["Hammie", "header_ham_string"]

    print "Training %s" % (str)
    
    docstomove = []
    doc = v.GetFirstDocument()
    while doc:
        try:
            subj = doc.GetItemValue('Subject')[0]
        except:
            subj = 'No Subject'

        try:
            body  = doc.GetItemValue('Body')[0]
        except:
            body = 'No Body'
            
        message = "Subject: %s\r\n%s" % (subj, body)

        options["Tokenizer", "generate_long_skips"] = False
        tokens = tokenizer.tokenize(message)

        nid = doc.NOTEID
        if notesindex.has_key(nid):
            trainedas = notesindex[nid]
            if trainedas == options["Hammie", "header_spam_string"] and \
               not is_spam:
                # msg is trained as spam, is to be retrained as ham
                bayes.unlearn(tokens, True)
            elif trainedas == options["Hammie", "header_ham_string"] and \
                 is_spam:
                # msg is trained as ham, is to be retrained as spam
                bayes.unlearn(tokens, False)
  
        bayes.learn(tokens, is_spam)

        notesindex[nid] = str
        docstomove += [doc]
        doc = v.GetNextDocument(doc)

    for doc in docstomove:
        doc.RemoveFromFolder(v.Name)
        doc.PutInFolder(vmoveto.Name)

    print "%s documents trained" % (len(docstomove))
    

def run(bdbname, useDBM, ldbname, rdbname, foldname, doTrain, doClassify):

    if useDBM:
        bayes = storage.DBDictClassifier(bdbname)
    else:
        bayes = storage.PickledClassifier(bdbname)

    try:
        fp = open("%s.sbindex" % (ldbname), 'rb')
    except IOError, e:
        if e.errno != errno.ENOENT: raise
        notesindex = {}
        print "%s.sbindex file not found, this is a first time run" \
              % (ldbname)
        print "No classification will be performed"
    else:
        notesindex = pickle.load(fp)
        fp.close()
     
    sess = win32com.client.Dispatch("Lotus.NotesSession")
    try:
        sess.initialize()
    except pywintypes.com_error:
        print "Session aborted"
        sys.exit()
        
    db = sess.GetDatabase("",ldbname)
    
    vinbox = db.getView('($Inbox)')
    vspam = db.getView("%s\Spam" % (foldname))
    vham = db.getView("%s\Ham" % (foldname))
    vtrainspam = db.getView("%s\Train as Spam" % (foldname))
    vtrainham = db.getView("%s\Train as Ham" % (foldname))
    
    if doTrain:
        processAndTrain(vtrainspam, vspam, bayes, True, notesindex)
        # for some reason, using inbox as a target here loses the mail
        processAndTrain(vtrainham, vham, bayes, False, notesindex)
        
    if rdbname:
        print "Replicating..."
        db.Replicate(rdbname)
        print "Done"
        
    if doClassify:
        classifyInbox(vinbox, vtrainspam, bayes, ldbname, notesindex)

    print "The Spambayes database currently has %s Spam and %s Ham" \
        % (bayes.nspam, bayes.nham)

    bayes.store()

    fp = open("%s.sbindex" % (ldbname), 'wb')
    pickle.dump(notesindex, fp)
    fp.close()
    

if __name__ == '__main__':

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'htcpd:D:l:r:f:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    bdbname = None  # bayes database name
    ldbname = None  # local notes database name
    rdbname = None  # remote notes database location
    sbfname = None  # spambayes folder name
    doTrain = False
    doClassify = False
    doPrompt = False

    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-d':
            useDBM = False
            bdbname = arg
        elif opt == '-D':
            useDBM = True
            bdbname = arg
        elif opt == '-l':
            ldbname = arg
        elif opt == '-r':
            rdbname = arg
        elif opt == '-f':
            sbfname = arg
        elif opt == '-t':
            doTrain = True
        elif opt == '-c':
            doClassify = True
        elif opt == '-p':
            doPrompt = True

    if (bdbname and ldbname and sbfname and (doTrain or doClassify)):
        run(bdbname, useDBM, ldbname, rdbname, \
            sbfname, doTrain, doClassify)

        if doPrompt:
            try:
                key = input("Press Enter to end")
            except SyntaxError:
                pass
    else:
        print >>sys.stderr, __doc__
--- NEW FILE: sb_pop3dnd.py ---
#!/usr/bin/env python

from __future__ import generators

"""
Overkill (someone *please* come up with something to call this script!)

This application is a twisted cross between a POP3 proxy and an IMAP
server.  It sits between your mail client and your POP3 server (like any
other POP3 proxy).  While messages classified as ham are simply passed
through the proxy, messages that are classified as spam or unsure are
intercepted and passed to the IMAP server.  The IMAP server offers three
folders - one where messages classified as spam end up, one for messages
it is unsure about, and one for training ham.

In other words, to use this application, setup your mail client to connect
to localhost, rather than directly to your POP3 server.  Additionally, add
a new IMAP account, also connecting to localhost.  Setup the application
via the web interface, and you are ready to go.  Good messages will appear
as per normal, but you will also have two new incoming folders, one for
spam and one for ham.

To train SpamBayes, use the spam folder, and the 'train_as_ham' folder.
Any messages in these folders will be trained appropriately.  This means
that all messages that SpamBayes classifies as spam will also be trained
as such.  If you receive any 'false positives' (ham classified as spam),
you *must* copy the message into the 'train_as_ham' folder to correct the
training.  You may also place any saved spam messages you have into this
folder.

So that SpamBayes knows about ham as well as spam, you will also need to
move or copy mail into the 'train_as_ham' folder.  These may come from
the unsure folder, or from any other mail you have saved.  It is a good
idea to leave messages in the 'train_as_ham' and 'spam' folders, so that
you can retrain from scratch if required.  (However, you should always
clear out your unsure folder, preferably moving or copying the messages
into the appropriate training folder).

This SpamBayes application is designed to work with Outlook Express, and
provide the same sort of ease of use as the Outlook plugin.  Although the
majority of development and testing has been done with Outlook Express,
any mail client that supports both IMAP and POP3 should be able to use this
application - if the client enables the user to work with an IMAP account
and POP3 account side-by-side (and move messages between them), then it
should work equally as well as Outlook Express.

This module includes the following classes:
 o IMAPFileMessage
 o IMAPFileMessageFactory
 o IMAPMailbox
 o SpambayesMailbox
 o Trainer
 o SpambayesAccount
 o SpambayesIMAPServer
 o OneParameterFactory
 o MyBayesProxy
 o MyBayesProxyListener
 o IMAPState
"""

todo = """
 o Message flags are currently not persisted, but should be.  The
   IMAPFileMessage class should be extended to do this.  The same
   goes for the 'internaldate' of the message.
 o The RECENT flag should be unset at some point, but when?  The
   RFC says that a message is recent if this is the first session
   to be notified about the message.  Perhaps this can be done
   simply by *not* persisting this flag - i.e. the flag is always
   loaded as not recent, and only new messages are recent.  The
   RFC says that if it is not possible to determine, then all
   messages should be recent, and this is what we currently do.
 o The Mailbox should be calling the appropriate listener
   functions (currently only newMessages is called on addMessage).
   flagsChanged should also be called on store, addMessage, or ???
 o We cannot currently get part of a message via the BODY calls
   (with the <> operands), or get a part of a MIME message (by
   prepending a number).  This should be added!
 o If the user clicks the 'save and shutdown' button on the web
   interface, this will only kill the POP3 proxy and web interface
   threads, and not the IMAP server.  We need to monitor the thread
   that we kick off, and if it dies, we should die too.  Need to figure
   out how to do this in twisted.
 o Suggestions?
"""

# This module is part of the spambayes project, which is Copyright 2002-3
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tony Meyer <ta-meyer at ihug.co.nz>"
__credits__ = "All the Spambayes folk."

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

import os
import re
import sys
import md5
import time
import errno
import types
import thread
import getopt
import imaplib
import operator
import StringIO
import email.Utils

from twisted import cred
from twisted.internet import defer
from twisted.internet import reactor
from twisted.internet.app import Application
from twisted.internet.defer import maybeDeferred
from twisted.internet.protocol import ServerFactory
from twisted.protocols.imap4 import parseNestedParens, parseIdList
from twisted.protocols.imap4 import IllegalClientResponse, IAccount
from twisted.protocols.imap4 import collapseNestedLists, MessageSet
from twisted.protocols.imap4 import IMAP4Server, MemoryAccount, IMailbox
from twisted.protocols.imap4 import IMailboxListener, collapseNestedLists

# Provide for those that don't have spambayes on their PYTHONPATH
sys.path.insert(-1, os.path.dirname(os.getcwd()))

from spambayes.Options import options
from spambayes.message import Message
from spambayes.tokenizer import tokenize
from spambayes import FileCorpus, Dibbler
from spambayes.Version import get_version_string
from spambayes.ServerUI import ServerUserInterface
from spambayes.UserInterface import UserInterfaceServer
from pop3proxy import POP3ProxyBase, State, _addressPortStr, _recreateState

def ensureDir(dirname):
    """Ensure that the given directory exists - in other words, if it
    does not exist, attempt to create it."""
    try:
        os.mkdir(dirname)
        if options["globals", "verbose"]:
            print "Creating directory", dirname
    except OSError, e:
        if e.errno != errno.EEXIST:
            raise


class IMAPFileMessage(FileCorpus.FileMessage):
    '''IMAP Message that persists as a file system artifact.'''

    def __init__(self, file_name, directory):
        """Constructor(message file name, corpus directory name)."""
        FileCorpus.FileMessage.__init__(self, file_name, directory)
        self.id = file_name
        self.directory = directory
        self.date = imaplib.Time2Internaldate(time.time())[1:-1]
        self.clear_flags()

    # IMessage implementation
    def getHeaders(self, negate, names):
        """Retrieve a group of message headers."""
        headers = {}
        if not isinstance(names, tuple):
            names = (names,)
        for header, value in self.items():
            if (header.upper() in names and not negate) or names == ():
                headers[header.upper()] = value
        return headers

    def getFlags(self):
        """Retrieve the flags associated with this message."""
        return self._flags_iter()

    def _flags_iter(self):    
        if self.deleted:
            yield "\\DELETED"
        if self.answered:
            yield "\\ANSWERED"
        if self.flagged:
            yield "\\FLAGGED"
        if self.seen:
            yield "\\SEEN"
        if self.draft:
            yield "\\DRAFT"
        if self.draft:
            yield "\\RECENT"

    def getInternalDate(self):
        """Retrieve the date internally associated with this message."""
        return self.date

    def getBodyFile(self):
        """Retrieve a file object containing the body of this message."""
        # Note only body, not headers!
        s = StringIO.StringIO()
        s.write(self.body())
        s.seek(0)
        return s
        #return file(os.path.join(self.directory, self.id), "r")

    def getSize(self):
        """Retrieve the total size, in octets, of this message."""
        return len(self.as_string())

    def getUID(self):
        """Retrieve the unique identifier associated with this message."""
        return self.id

    def getSubPart(self, part):
        """Retrieve a MIME sub-message
        
        @type part: C{int}
        @param part: The number of the part to retrieve, indexed from 0.
        
        @rtype: Any object implementing C{IMessage}.
        @return: The specified sub-part.
        """

    # IMessage implementation ends

    def clear_flags(self):
        """Set all message flags to false."""
        self.deleted = False
        self.answered = False
        self.flagged = False
        self.seen = False
        self.draft = False
        self.recent = False

    def set_flag(self, flag, value):
        # invalid flags are ignored
        flag = flag.upper()
        if flag == "\\DELETED":
            self.deleted = value
        elif flag == "\\ANSWERED":
            self.answered = value
        elif flag == "\\FLAGGED":
            self.flagged = value
        elif flag == "\\SEEN":
            self.seen = value
        elif flag == "\\DRAFT":
            self.draft = value
        else:
            print "Tried to set invalid flag", flag, "to", value
            
    def flags(self):
        """Return the message flags."""
        all_flags = []
        if self.deleted:
            all_flags.append("\\DELETED")
        if self.answered:
            all_flags.append("\\ANSWERED")
        if self.flagged:
            all_flags.append("\\FLAGGED")
        if self.seen:
            all_flags.append("\\SEEN")
        if self.draft:
            all_flags.append("\\DRAFT")
        if self.draft:
            all_flags.append("\\RECENT")
        return all_flags

    def train(self, classifier, isSpam):
        if self.GetTrained() == (not isSpam):
            classifier.unlearn(self.asTokens(), not isSpam)
            self.RememberTrained(None)
        if self.GetTrained() is None:
            classifier.learn(self.asTokens(), isSpam)
            self.RememberTrained(isSpam)
        classifier.store()

    def structure(self, ext=False):
        """Body structure data describes the MIME-IMB
        format of a message and consists of a sequence of mime type, mime
        subtype, parameters, content id, description, encoding, and size. 
        The fields following the size field are variable: if the mime
        type/subtype is message/rfc822, the contained message's envelope
        information, body structure data, and number of lines of text; if
        the mime type is text, the number of lines of text.  Extension fields
        may also be included; if present, they are: the MD5 hash of the body,
        body disposition, body language."""
        s = []
        for part in self.walk():
            if part.get_content_charset() is not None:
                charset = ("charset", part.get_content_charset())
            else:
                charset = None
            part_s = [part.get_main_type(), part.get_subtype(),
                      charset,
                      part.get('Content-Id'),
                      part.get('Content-Description'),
                      part.get('Content-Transfer-Encoding'),
                      str(len(part.as_string()))]
            #if part.get_type() == "message/rfc822":
            #    part_s.extend([envelope, body_structure_data,
            #                  part.as_string().count("\n")])
            #elif part.get_main_type() == "text":
            if part.get_main_type() == "text":
                part_s.append(str(part.as_string().count("\n")))
            if ext:
                part_s.extend([md5.new(part.as_string()).digest(),
                               part.get('Content-Disposition'),
                               part.get('Content-Language')])
            s.append(part_s)
        if len(s) == 1:
            return s[0]
        return s

    def body(self):    
        rfc822 = self.as_string()
        bodyRE = re.compile(r"\r?\n(\r?\n)(.*)",
                            re.DOTALL + re.MULTILINE)
        bmatch = bodyRE.search(rfc822)
        return bmatch.group(2)

    def headers(self):
        rfc822 = self.as_string()
        bodyRE = re.compile(r"\r?\n(\r?\n)(.*)",
                            re.DOTALL + re.MULTILINE)
        bmatch = bodyRE.search(rfc822)
        return rfc822[:bmatch.start(2)]

    def on(self, date1, date2):
        "contained within the date"
        raise NotImplementedError
    def before(self, date1, date2):
        "before the date"
        raise NotImplementedError
    def since(self, date1, date2):
        "within or after the date"
        raise NotImplementedError

    def string_contains(self, whole, sub):
        return whole.find(sub) != -1
        
    def matches(self, criteria):
        """Return True iff the messages matches the specified IMAP
        criteria."""
        match_tests = {"ALL" : [(True, True)],
                       "ANSWERED" : [(self.answered, True)],
                       "DELETED" : [(self.deleted, True)],
                       "DRAFT" : [(self.draft, True)],
                       "FLAGGED" : [(self.flagged, True)],
                       "NEW" : [(self.recent, True), (self.seen, False)],
                       "RECENT" : [(self.recent, True)],
                       "SEEN" : [(self.seen, True)],
                       "UNANSWERED" : [(self.answered, False)],
                       "UNDELETED" : [(self.deleted, False)],
                       "UNDRAFT" : [(self.draft, False)],
                       "UNFLAGGED" : [(self.flagged, False)],
                       "UNSEEN" : [(self.seen, False)],
                       "OLD" : [(self.recent, False)],
                       }
        complex_tests = {"BCC" : (self.string_contains, self.get("Bcc")),
                         "SUBJECT" : (self.string_contains, self.get("Subject")),
                         "CC" : (self.string_contains, self.get("Cc")),
                         "BODY" : (self.string_contains, self.body()),
                         "TO" : (self.string_contains, self.get("To")),
                         "TEXT" : (self.string_contains, self.as_string()),
                         "FROM" : (self.string_contains, self.get("From")),
                         "SMALLER" : (operator.lt, len(self.as_string())),
                         "LARGER" : (operator.gt, len(self.as_string())),
                         "BEFORE" : (self.before, self.date),
                         "ON" : (self.on, self.date),
                         "SENTBEFORE" : (self.before, self.get("Date")),
                         "SENTON" : (self.on, self.get("Date")),
                         "SENTSINCE" : (self.since, self.get("Date")),
                         "SINCE" : (self.since, self.date),
                         }
                       
        result = True
        test = None
        header = None
        header_field = None
        for c in criteria:
            if match_tests.has_key(c) and test is None and header is None:
                for (test, result) in match_tests[c]:
                    result = result and (test == result)
            elif complex_tests.has_key(c) and test is None and header is None:
                test = complex_tests[c]
            elif test is not None and header is None:
                result = result and test[0](test[1], c)
                test = None
            elif c == "HEADER" and test is None:
                # the only criteria that uses the next _two_ elements
                header = c
            elif test is None and header is not None and header_field is None:
                header_field = c
            elif header is not None and header_field is not None and test is None:
                result = result and self.string_contains(self.get(header_field), c)
                header = None
                header_field = None
        return result
"""
Still to do:
      <message set>  Messages with message sequence numbers
                     corresponding to the specified message sequence
                     number set
      UID <message set>
                     Messages with unique identifiers corresponding to
                     the specified unique identifier set.

      KEYWORD <flag> Messages with the specified keyword set.
      UNKEYWORD <flag>
                     Messages that do not have the specified keyword
                     set.

      NOT <search-key>
                     Messages that do not match the specified search
                     key.

      OR <search-key1> <search-key2>
                     Messages that match either search key.
"""


class IMAPFileMessageFactory(FileCorpus.FileMessageFactory):
    '''MessageFactory for IMAPFileMessage objects'''
    def create(self, key, directory):
        '''Create a message object from a filename in a directory'''
        return IMAPFileMessage(key, directory)


class IMAPMailbox(cred.perspective.Perspective):
    __implements__ = (IMailbox,)

    def __init__(self, name, identity_name, id):
        cred.perspective.Perspective.__init__(self, name, identity_name)
        self.UID_validity = id
        self.listeners = []

    def getUIDValidity(self):
        """Return the unique validity identifier for this mailbox."""
        return self.UID_validity

    def addListener(self, listener):
        """Add a mailbox change listener."""
        self.listeners.append(listener)
    
    def removeListener(self, listener):
        """Remove a mailbox change listener."""
        self.listeners.remove(listener)


class SpambayesMailbox(IMAPMailbox):
    def __init__(self, name, id, directory):
        IMAPMailbox.__init__(self, name, "spambayes", id)
        self.UID_validity = id
        ensureDir(directory)
        self.storage = FileCorpus.FileCorpus(IMAPFileMessageFactory(),
                                             directory, r"[0123456789]*")
        # UIDs are required to be strictly ascending.
        if len(self.storage.keys()) == 0:
            self.nextUID = 0
        else:
            self.nextUID = long(self.storage.keys()[-1]) + 1
        # Calculate initial recent and unseen counts
        # XXX Note that this will always end up with zero counts
        # XXX until the flags are persisted.
        self.unseen_count = 0
        self.recent_count = 0
        for msg in self.storage:
            if not msg.seen:
                self.unseen_count += 1
            if msg.recent:
                self.recent_count += 1
    
    def getUIDNext(self, increase=False):
        """Return the likely UID for the next message added to this
        mailbox."""
        reply = str(self.nextUID)
        if increase:
            self.nextUID += 1
        return reply

    def getUID(self, message):
        """Return the UID of a message in the mailbox."""
        # Note that IMAP messages are 1-based, our messages are 0-based
        d = self.storage
        return long(d.keys()[message - 1])

    def getFlags(self):
        """Return the flags defined in this mailbox."""
        return ["\\Answered", "\\Flagged", "\\Deleted", "\\Seen",
                "\\Draft"]

    def getMessageCount(self):
        """Return the number of messages in this mailbox."""
        return len(self.storage.keys())

    def getRecentCount(self):
        """Return the number of messages with the 'Recent' flag."""
        return self.recent_count

    def getUnseenCount(self):
        """Return the number of messages with the 'Unseen' flag."""
        return self.unseen_count
        
    def isWriteable(self):
        """Get the read/write status of the mailbox."""
        return True

    def destroy(self):
        """Called before this mailbox is deleted, permanently."""
        # Our mailboxes cannot be deleted
        raise NotImplementedError

    def getHierarchicalDelimiter(self):
        """Get the character which delimits namespaces for in this
        mailbox."""
        return '.'

    def requestStatus(self, names):
        """Return status information about this mailbox."""
        answer = {}
        for request in names:
            request = request.upper()
            if request == "MESSAGES":
                answer[request] = self.getMessageCount()
            elif request == "RECENT":
                answer[request] = self.getRecentCount()
            elif request == "UIDNEXT":
                answer[request] = self.getUIDNext()
            elif request == "UIDVALIDITY":
                answer[request] = self.getUIDValidity()
            elif request == "UNSEEN":
                answer[request] = self.getUnseenCount()
        return answer

    def addMessage(self, message, flags=(), date=None):
        """Add the given message to this mailbox."""
        msg = self.storage.makeMessage(self.getUIDNext(True))
        msg.date = date
        msg.setPayload(message.read())
        self.storage.addMessage(msg)
        self.store(MessageSet(long(msg.id), long(msg.id)), flags, 1, True)
        msg.recent = True
        msg.store()
        self.recent_count += 1
        self.unseen_count += 1

        for listener in self.listeners:
            listener.newMessages(self.getMessageCount(),
                                 self.getRecentCount())
        d = defer.Deferred()
        reactor.callLater(0, d.callback, self.storage.keys().index(msg.id))
        return d

    def expunge(self):
        """Remove all messages flagged \\Deleted."""
        deleted_messages = []
        for msg in self.storage:
            if msg.deleted:
                if not msg.seen:
                    self.unseen_count -= 1
                if msg.recent:
                    self.recent_count -= 1
                deleted_messages.append(long(msg.id))
                self.storage.removeMessage(msg)
        if deleted_messages != []:
            for listener in self.listeners:
                listener.newMessages(self.getMessageCount(),
                                     self.getRecentCount())
        return deleted_messages

    def search(self, query, uid):
        """Search for messages that meet the given query criteria.

        @type query: C{list}
        @param query: The search criteria

        @rtype: C{list}
        @return: A list of message sequence numbers or message UIDs which
        match the search criteria.
        """
        if self.getMessageCount() == 0:
            return []
        all_msgs = MessageSet(long(self.storage.keys()[0]),
                              long(self.storage.keys()[-1]))
        matches = []
        for id, msg in self._messagesIter(all_msgs, uid):
            for q in query:
                if msg.matches(q):
                    matches.append(id)
                    break
        return matches            

    def _messagesIter(self, messages, uid):
        if uid:
            messages.last = long(self.storage.keys()[-1])
        else:
            messages.last = self.getMessageCount()
        for id in messages:
            if uid:
                msg = self.storage.get(str(id))
            else:
                msg = self.storage.get(str(self.getUID(id)))
            if msg is None:
                # Non-existant message.
                continue
            msg.load()
            yield (id, msg)

    def fetch(self, messages, uid):
        """Retrieve one or more messages."""
        return self._messagesIter(messages, uid)

    def store(self, messages, flags, mode, uid):
        """Set the flags of one or more messages."""
        stored_messages = {}
        for id, msg in self._messagesIter(messages, uid):
            if mode == 0:
                msg.clear_flags()
                value = True
            elif mode == -1:
                value = False
            elif mode == 1:
                value = True
            for flag in flags:
                if flag == '(' or flag == ')':
                    continue
                if flag == "SEEN" and value == True and msg.seen == False:
                    self.unseen_count -= 1
                if flag == "SEEN" and value == False and msg.seen == True:
                    self.unseen_count += 1
                msg.set_flag(flag, value)
            stored_messages[id] = msg.flags()
        return stored_messages


class Trainer(object):
    """Listens to a given mailbox and trains new messages as spam or
    ham."""
    __implements__ = (IMailboxListener,)

    def __init__(self, mailbox, asSpam):
        self.mailbox = mailbox
        self.asSpam = asSpam

    def modeChanged(self, writeable):
        # We don't care
        pass
    
    def flagsChanged(self, newFlags):
        # We don't care
        pass

    def newMessages(self, exists, recent):
        # We don't get passed the actual message, or the id of
        # the message, of even the message number.  We just get
        # the total number of new/recent messages.
        # However, this function should be called _every_ time
        # that a new message appears, so we should be able to
        # assume that the last message is the new one.
        # (We ignore the recent count)
        if exists is not None:
            id = self.mailbox.getUID(exists)
            msg = self.mailbox.storage[str(id)]
            msg.train(state.bayes, self.asSpam)


class SpambayesAccount(MemoryAccount):
    """Account for Spambayes server."""

    def __init__(self, id, ham, spam, unsure):
        MemoryAccount.__init__(self, id)
        self.mailboxes = {"SPAM" : spam,
                          "UNSURE" : unsure,
                          "TRAIN_AS_HAM" : ham}

    def select(self, name, readwrite=1):
        # 'INBOX' is a special case-insensitive name meaning the
        # primary mailbox for the user; we interpret this as an alias
        # for 'spam'
        if name.upper() == "INBOX":
            name = "SPAM"
        return MemoryAccount.select(self, name, readwrite)


class SpambayesIMAPServer(IMAP4Server):
    IDENT = "Spambayes IMAP Server IMAP4rev1 Ready"

    def __init__(self, user_account):
        IMAP4Server.__init__(self)
        self.account = user_account

    def authenticateLogin(self, user, passwd):
        """Lookup the account associated with the given parameters."""
        if user == options["imapserver", "username"] and \
           passwd == options["imapserver", "password"]:
            return (IAccount, self.account, None)
        raise cred.error.UnauthorizedLogin()

    def connectionMade(self):
        state.activeIMAPSessions += 1
        state.totalIMAPSessions += 1
        IMAP4Server.connectionMade(self)

    def connectionLost(self, reason):
        state.activeIMAPSessions -= 1
        IMAP4Server.connectionLost(self, reason)

    def do_CREATE(self, tag, args):
        """Creating new folders on the server is not permitted."""
        self.sendNegativeResponse(tag, \
                                  "Creation of new folders is not permitted")
    auth_CREATE = (do_CREATE, IMAP4Server.arg_astring)
    select_CREATE = auth_CREATE

    def do_DELETE(self, tag, args):
        """Deleting folders on the server is not permitted."""
        self.sendNegativeResponse(tag, \
                                  "Deletion of folders is not permitted")
    auth_DELETE = (do_DELETE, IMAP4Server.arg_astring)
    select_DELETE = auth_DELETE


class OneParameterFactory(ServerFactory):
    """A factory that allows a single parameter to be passed to the created
    protocol."""
    def buildProtocol(self, addr):
        """Create an instance of a subclass of Protocol, passing a single
        parameter."""
        if self.parameter is not None:
            p = self.protocol(self.parameter)
        else:
            p = self.protocol()
        p.factory = self
        return p


class MyBayesProxy(POP3ProxyBase):
    """Proxies between an email client and a POP3 server, redirecting
    mail to the imap server as necessary.  It acts on the following
    POP3 commands:

     o RETR:
        o Adds the judgement header based on the raw headers and body
          of the message.
    """

    intercept_message = 'From: "Spambayes" <no-reply at localhost>\n' \
                        'Subject: Spambayes Intercept\n\nA message ' \
                        'was intercepted by Spambayes (it scored %s).\n' \
                        '\nYou may find it in the Spam or Unsure ' \
                        'folder.\n\n.\n'

    def __init__(self, clientSocket, serverName, serverPort, spam, unsure):
        POP3ProxyBase.__init__(self, clientSocket, serverName, serverPort)
        self.handlers = {'RETR': self.onRetr}
        state.totalSessions += 1
        state.activeSessions += 1
        self.isClosed = False
        self.spam_folder = spam
        self.unsure_folder = unsure

    def send(self, data):
        """Logs the data to the log file."""
        if options["globals", "verbose"]:
            state.logFile.write(data)
            state.logFile.flush()
        try:
            return POP3ProxyBase.send(self, data)
        except socket.error:
            self.close()

    def recv(self, size):
        """Logs the data to the log file."""
        data = POP3ProxyBase.recv(self, size)
        if options["globals", "verbose"]:
            state.logFile.write(data)
            state.logFile.flush()
        return data

    def close(self):
        # This can be called multiple times by async.
        if not self.isClosed:
            self.isClosed = True
            state.activeSessions -= 1
            POP3ProxyBase.close(self)

    def onTransaction(self, command, args, response):
        """Takes the raw request and response, and returns the
        (possibly processed) response to pass back to the email client.
        """
        handler = self.handlers.get(command, self.onUnknown)
        return handler(command, args, response)

    def onRetr(self, command, args, response):
        """Classifies the message.  If the result is ham, then simply
        pass it through.  If the result is an unsure or spam, move it
        to the appropriate IMAP folder."""
        # Use '\n\r?\n' to detect the end of the headers in case of
        # broken emails that don't use the proper line separators.
        if re.search(r'\n\r?\n', response):
            # Break off the first line, which will be '+OK'.
            ok, messageText = response.split('\n', 1)

            prob = state.bayes.spamprob(tokenize(messageText))
            if prob < options["Categorization", "ham_cutoff"]:
                # Return the +OK and the message with the header added.
                state.numHams += 1
                return ok + "\n" + messageText
            elif prob > options["Categorization", "spam_cutoff"]:
                dest_folder = self.spam_folder
                state.numSpams += 1
            else:
                dest_folder = self.unsure_folder
                state.numUnsure += 1
            msg = StringIO.StringIO(messageText)
            date = imaplib.Time2Internaldate(time.time())[1:-1]
            dest_folder.addMessage(msg, (), date)
            
            # We have to return something, because the client is expecting
            # us to.  We return a short message indicating that a message
            # was intercepted.
            return ok + "\n" + self.intercept_message % (prob,)
        else:
            # Must be an error response.
            return response

    def onUnknown(self, command, args, response):
        """Default handler; returns the server's response verbatim."""
        return response


class MyBayesProxyListener(Dibbler.Listener):
    """Listens for incoming email client connections and spins off
    MyBayesProxy objects to serve them.
    """

    def __init__(self, serverName, serverPort, proxyPort, spam, unsure):
        proxyArgs = (serverName, serverPort, spam, unsure)
        Dibbler.Listener.__init__(self, proxyPort, MyBayesProxy, proxyArgs)
        print 'Listener on port %s is proxying %s:%d' % \
               (_addressPortStr(proxyPort), serverName, serverPort)


class IMAPState(State):
    def __init__(self):
        State.__init__(self)

        # Set up the extra statistics.
        self.totalIMAPSessions = 0
        self.activeIMAPSessions = 0

    def buildServerStrings(self):
        """After the server details have been set up, this creates string
        versions of the details, for display in the Status panel."""
        self.serverPortString = str(self.imap_port)
        # Also build proxy strings
        State.buildServerStrings(self)

state = IMAPState()

# ===================================================================
# __main__ driver.
# ===================================================================

def setup():
    # Setup app, boxes, trainers and account
    proxyListeners = []
    app = Application("SpambayesIMAPServer")

    spam_box = SpambayesMailbox("Spam", 0, options["imapserver",
                                                   "spam_directory"])
    unsure_box = SpambayesMailbox("Unsure", 1, options["imapserver",
                                                       "unsure_directory"])
    ham_train_box = SpambayesMailbox("TrainAsHam", 2,
                                     options["imapserver", "ham_directory"])

    spam_trainer = Trainer(spam_box, True)
    ham_trainer = Trainer(ham_train_box, False)
    spam_box.addListener(spam_trainer)
    ham_train_box.addListener(ham_trainer)

    user_account = SpambayesAccount(options["imapserver", "username"],
                                    ham_train_box, spam_box, unsure_box)

    # add IMAP4 server
    f = OneParameterFactory()
    f.protocol = SpambayesIMAPServer
    f.parameter = user_account
    state.imap_port = options["imapserver", "port"]
    app.listenTCP(state.imap_port, f)

    # add POP3 proxy
    state.createWorkers()
    for (server, serverPort), proxyPort in zip(state.servers,
                                               state.proxyPorts):
        listener = MyBayesProxyListener(server, serverPort, proxyPort,
                                        spam_box, unsure_box)
        proxyListeners.append(listener)
    state.buildServerStrings()

    # add web interface
    httpServer = UserInterfaceServer(state.uiPort)
    serverUI = ServerUserInterface(state, _recreateState)
    httpServer.register(serverUI)

    return app    

def run():
    # Read the arguments.
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hbd:D:u:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    launchUI = False
    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-b':
            launchUI = True
        elif opt == '-d':   # dbm file
            state.useDB = True
            options["Storage", "persistent_storage_file"] = arg
        elif opt == '-D':   # pickle file
            state.useDB = False
            options["Storage", "persistent_storage_file"] = arg
        elif opt == '-u':
            state.uiPort = int(arg)

    # Let the user know what they are using...
    print get_version_string("IMAP Server")
    print "and engine %s.\n" % (get_version_string(),)

    # setup everything
    app = setup()

    # kick things off
    thread.start_new_thread(Dibbler.run, (launchUI,))
    app.run(save=False)

if __name__ == "__main__":
    run()

--- NEW FILE: sb_server.py ---
#!/usr/bin/env python

"""A POP3 proxy that works with classifier.py, and adds a simple
X-Spambayes-Classification header (ham/spam/unsure) to each incoming
email.  You point pop3proxy at your POP3 server, and configure your
email client to collect mail from the proxy then filter on the added
header.  Usage:

    pop3proxy.py [options] [<server> [<server port>]]
        <server> is the name of your real POP3 server
        <port>   is the port number of your real POP3 server, which
                 defaults to 110.

        options:
            -h      : Displays this help message.
            -d FILE : use the named DBM database file
            -D FILE : the the named Pickle database file
            -l port : proxy listens on this port number (default 110)
            -u port : User interface listens on this port number
                      (default 8880; Browse http://localhost:8880/)
            -b      : Launch a web browser showing the user interface.

        All command line arguments and switches take their default
        values from the [pop3proxy] and [html_ui] sections of
        bayescustomize.ini.

For safety, and to help debugging, the whole POP3 conversation is
written out to _pop3proxy.log for each run, if
options["globals", "verbose"] is True.

To make rebuilding the database easier, uploaded messages are appended
to _pop3proxyham.mbox and _pop3proxyspam.mbox.
"""

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Richie Hindle <richie at entrian.com>"
__credits__ = "Tim Peters, Neale Pickett, Tim Stone, all the Spambayes folk."

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


todo = """

Web training interface:

User interface improvements:

 o Once the pieces are on separate pages, make the paste box bigger.
 o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
   webbrowser?
 o Save the stats (num classified, etc.) between sessions.
 o "Reload database" button.


New features:

 o Online manual.
 o Links to project homepage, mailing list, etc.
 o List of words with stats (it would have to be paged!) a la SpamSieve.


Code quality:

 o Cope with the email client timing out and closing the connection.


Info:

 o Slightly-wordy index page; intro paragraph for each page.
 o In both stats and training results, report nham and nspam - warn if
   they're very different (for some value of 'very').
 o "Links" section (on homepage?) to project homepage, mailing list,
   etc.


Gimmicks:

 o Classify a web page given a URL.
 o Graphs.  Of something.  Who cares what?
 o NNTP proxy.
 o Zoe...!
"""

import os, sys, re, errno, getopt, time, traceback, socket, cStringIO
from thread import start_new_thread
from email.Header import Header

import spambayes.message
from spambayes import Dibbler
from spambayes import storage
from spambayes.FileCorpus import FileCorpus, ExpiryFileCorpus
from spambayes.FileCorpus import FileMessageFactory, GzipFileMessageFactory
from spambayes.Options import options
from spambayes.UserInterface import UserInterfaceServer
from spambayes.ProxyUI import ProxyUserInterface
from spambayes.Version import get_version_string

# Increase the stack size on MacOS X.  Stolen from Lib/test/regrtest.py
if sys.platform == 'darwin':
    try:
        import resource
    except ImportError:
        pass
    else:
        soft, hard = resource.getrlimit(resource.RLIMIT_STACK)
        newsoft = min(hard, max(soft, 1024*2048))
        resource.setrlimit(resource.RLIMIT_STACK, (newsoft, hard))

# number to add to STAT length for each msg to fudge for spambayes headers
HEADER_SIZE_FUDGE_FACTOR = 512

class ServerLineReader(Dibbler.BrighterAsyncChat):
    """An async socket that reads lines from a remote server and
    simply calls a callback with the data.  The BayesProxy object
    can't connect to the real POP3 server and talk to it
    synchronously, because that would block the process."""

    lineCallback = None

    def __init__(self, serverName, serverPort, lineCallback):
        Dibbler.BrighterAsyncChat.__init__(self)
        self.lineCallback = lineCallback
        self.request = ''
        self.set_terminator('\r\n')
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
        try:
            self.connect((serverName, serverPort))
        except socket.error, e:
            error = "Can't connect to %s:%d: %s" % (serverName, serverPort, e)
            print >>sys.stderr, error
            self.lineCallback('-ERR %s\r\n' % error)
            self.lineCallback('')   # "The socket's been closed."
            self.close()

    def collect_incoming_data(self, data):
        self.request = self.request + data

    def found_terminator(self):
        self.lineCallback(self.request + '\r\n')
        self.request = ''

    def handle_close(self):
        self.lineCallback('')
        self.close()


class POP3ProxyBase(Dibbler.BrighterAsyncChat):
    """An async dispatcher that understands POP3 and proxies to a POP3
    server, calling `self.onTransaction(request, response)` for each
    transaction. Responses are not un-byte-stuffed before reaching
    self.onTransaction() (they probably should be for a totally generic
    POP3ProxyBase class, but BayesProxy doesn't need it and it would
    mean re-stuffing them afterwards).  self.onTransaction() should
    return the response to pass back to the email client - the response
    can be the verbatim response or a processed version of it.  The
    special command 'KILL' kills it (passing a 'QUIT' command to the
    server).
    """

    def __init__(self, clientSocket, serverName, serverPort):
        Dibbler.BrighterAsyncChat.__init__(self, clientSocket)
        self.request = ''
        self.response = ''
        self.set_terminator('\r\n')
        self.command = ''           # The POP3 command being processed...
        self.args = []              # ...and its arguments
        self.isClosing = False      # Has the server closed the socket?
        self.seenAllHeaders = False # For the current RETR or TOP
        self.startTime = 0          # (ditto)
        self.serverSocket = ServerLineReader(serverName, serverPort,
                                             self.onServerLine)

    def onTransaction(self, command, args, response):
        """Overide this.  Takes the raw request and the response, and
        returns the (possibly processed) response to pass back to the
        email client.
        """
        raise NotImplementedError

    def onServerLine(self, line):
        """A line of response has been received from the POP3 server."""
        isFirstLine = not self.response
        self.response = self.response + line

        # Is this the line that terminates a set of headers?
        self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']

        # Has the server closed its end of the socket?
        if not line:
            self.isClosing = True

        # If we're not processing a command, just echo the response.
        if not self.command:
            self.push(self.response)
            self.response = ''

        # Time out after 30 seconds for message-retrieval commands if
        # all the headers are down.  The rest of the message will proxy
        # straight through.
        if self.command in ['TOP', 'RETR'] and \
           self.seenAllHeaders and time.time() > self.startTime + 30:
            self.onResponse()
            self.response = ''
        # If that's a complete response, handle it.
        elif not self.isMultiline() or line == '.\r\n' or \
           (isFirstLine and line.startswith('-ERR')):
            self.onResponse()
            self.response = ''

    def isMultiline(self):
        """Returns True if the request should get a multiline
        response (assuming the response is positive).
        """
        if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
                            'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
            return False
        elif self.command in ['RETR', 'TOP', 'CAPA']:
            return True
        elif self.command in ['LIST', 'UIDL']:
            return len(self.args) == 0
        else:
            # Assume that an unknown command will get a single-line
            # response.  This should work for errors and for POP-AUTH,
            # and is harmless even for multiline responses - the first
            # line will be passed to onTransaction and ignored, then the
            # rest will be proxied straight through.
            return False

    def collect_incoming_data(self, data):
        """Asynchat override."""
        self.request = self.request + data

    def found_terminator(self):
        """Asynchat override."""
        verb = self.request.strip().upper()
        if verb == 'KILL':
            self.socket.shutdown(2)
            self.close()
            raise SystemExit
        elif verb == 'CRASH':
            # For testing
            x = 0
            y = 1/x

        self.serverSocket.push(self.request + '\r\n')
        if self.request.strip() == '':
            # Someone just hit the Enter key.
            self.command = ''
            self.args = []
        else:
            # A proper command.
            splitCommand = self.request.strip().split()
            self.command = splitCommand[0].upper()
            self.args = splitCommand[1:]
            self.startTime = time.time()

        self.request = ''

    def onResponse(self):
        # We don't support pipelining, so if the command is CAPA and the
        # response includes PIPELINING, hack out that line of the response.
        if self.command == 'CAPA':
            pipelineRE = r'(?im)^PIPELINING[^\n]*\n'
            self.response = re.sub(pipelineRE, '', self.response)

        # Pass the request and the raw response to the subclass and
        # send back the cooked response.
        if self.response:
            cooked = self.onTransaction(self.command, self.args, self.response)
            self.push(cooked)

        # If onServerLine() decided that the server has closed its
        # socket, close this one when the response has been sent.
        if self.isClosing:
            self.close_when_done()

        # Reset.
        self.command = ''
        self.args = []
        self.isClosing = False
        self.seenAllHeaders = False


class BayesProxyListener(Dibbler.Listener):
    """Listens for incoming email client connections and spins off
    BayesProxy objects to serve them.
    """

    def __init__(self, serverName, serverPort, proxyPort):
        proxyArgs = (serverName, serverPort)
        Dibbler.Listener.__init__(self, proxyPort, BayesProxy, proxyArgs)
        print 'Listener on port %s is proxying %s:%d' % \
               (_addressPortStr(proxyPort), serverName, serverPort)


class BayesProxy(POP3ProxyBase):
    """Proxies between an email client and a POP3 server, inserting
    judgement headers.  It acts on the following POP3 commands:

     o STAT:
        o Adds the size of all the judgement headers to the maildrop
          size.

     o LIST:
        o With no message number: adds the size of an judgement header
          to the message size for each message in the scan listing.
        o With a message number: adds the size of an judgement header
          to the message size.

     o RETR:
        o Adds the judgement header based on the raw headers and body
          of the message.

     o TOP:
        o Adds the judgement header based on the raw headers and as
          much of the body as the TOP command retrieves.  This can
          mean that the header might have a different value for
          different calls to TOP, or for calls to TOP vs. calls to
          RETR.  I'm assuming that the email client will either not
          make multiple calls, or will cope with the headers being
          different.

     o USER:
        o Does no processing based on the USER command itself, but
          expires any old messages in the three caches.
    """

    def __init__(self, clientSocket, serverName, serverPort):
        POP3ProxyBase.__init__(self, clientSocket, serverName, serverPort)
        self.handlers = {'STAT': self.onStat, 'LIST': self.onList,
                         'RETR': self.onRetr, 'TOP': self.onTop,
                         'USER': self.onUser}
        state.totalSessions += 1
        state.activeSessions += 1
        self.isClosed = False

    def send(self, data):
        """Logs the data to the log file."""
        if options["globals", "verbose"]:
            state.logFile.write(data)
            state.logFile.flush()
        try:
            return POP3ProxyBase.send(self, data)
        except socket.error:
            # The email client has closed the connection - 40tude Dialog
            # does this immediately after issuing a QUIT command,
            # without waiting for the response.
            self.close()

    def recv(self, size):
        """Logs the data to the log file."""
        data = POP3ProxyBase.recv(self, size)
        if options["globals", "verbose"]:
            state.logFile.write(data)
            state.logFile.flush()
        return data

    def close(self):
        # This can be called multiple times by async.
        if not self.isClosed:
            self.isClosed = True
            state.activeSessions -= 1
            POP3ProxyBase.close(self)

    def onTransaction(self, command, args, response):
        """Takes the raw request and response, and returns the
        (possibly processed) response to pass back to the email client.
        """
        handler = self.handlers.get(command, self.onUnknown)
        return handler(command, args, response)

    def onStat(self, command, args, response):
        """Adds the size of all the judgement headers to the maildrop
        size."""
        match = re.search(r'^\+OK\s+(\d+)\s+(\d+)(.*)\r\n', response)
        if match:
            count = int(match.group(1))
            size = int(match.group(2)) + HEADER_SIZE_FUDGE_FACTOR * count
            return '+OK %d %d%s\r\n' % (count, size, match.group(3))
        else:
            return response

    def onList(self, command, args, response):
        """Adds the size of an judgement header to the message
        size(s)."""
        if response.count('\r\n') > 1:
            # Multiline: all lines but the first contain a message size.
            lines = response.split('\r\n')
            outputLines = [lines[0]]
            for line in lines[1:]:
                match = re.search(r'^(\d+)\s+(\d+)', line)
                if match:
                    number = int(match.group(1))
                    size = int(match.group(2)) + HEADER_SIZE_FUDGE_FACTOR
                    line = "%d %d" % (number, size)
                outputLines.append(line)
            return '\r\n'.join(outputLines)
        else:
            # Single line.
            match = re.search(r'^\+OK\s+(\d+)\s+(\d+)(.*)\r\n', response)
            if match:
                messageNumber = match.group(1)
                size = int(match.group(2)) + HEADER_SIZE_FUDGE_FACTOR
                trailer = match.group(3)
                return "+OK %s %s%s\r\n" % (messageNumber, size, trailer)
            else:
                return response

    def onRetr(self, command, args, response):
        """Adds the judgement header based on the raw headers and body
        of the message."""
        # Use '\n\r?\n' to detect the end of the headers in case of
        # broken emails that don't use the proper line separators.
        if re.search(r'\n\r?\n', response):
            # Remove the trailing .\r\n before passing to the email parser.
            # Thanks to Scott Schlesier for this fix.
            terminatingDotPresent = (response[-4:] == '\n.\r\n')
            if terminatingDotPresent:
                response = response[:-3]

            # Break off the first line, which will be '+OK'.
            ok, messageText = response.split('\n', 1)

            try:
                msg = spambayes.message.SBHeaderMessage()
                msg.setPayload(messageText)
                msg.setId(state.getNewMessageName())
                # Now find the spam disposition and add the header.
                (prob, clues) = state.bayes.spamprob(msg.asTokens(),\
                                 evidence=True)

                msg.addSBHeaders(prob, clues)

                # Check for "RETR" or "TOP N 99999999" - fetchmail without
                # the 'fetchall' option uses the latter to retrieve messages.
                if (command == 'RETR' or
                    (command == 'TOP' and
                     len(args) == 2 and args[1] == '99999999')):
                    cls = msg.GetClassification()
                    if cls == options["Headers", "header_ham_string"]:
                        state.numHams += 1
                    elif cls == options["Headers", "header_spam_string"]:
                        state.numSpams += 1
                    else:
                        state.numUnsure += 1

                    # Suppress caching of "Precedence: bulk" or
                    # "Precedence: list" ham if the options say so.
                    isSuppressedBulkHam = \
                        (cls == options["Headers", "header_ham_string"] and
                         options["pop3proxy", "no_cache_bulk_ham"] and
                         msg.get('precedence') in ['bulk', 'list'])

                    # Suppress large messages if the options say so.
                    size_limit = options["pop3proxy",
                                         "no_cache_large_messages"]
                    isTooBig = size_limit > 0 and \
                               len(messageText) > size_limit

                    # Cache the message.  Don't pollute the cache with test
                    # messages or suppressed bulk ham.
                    if (not state.isTest and
                        options["pop3proxy", "cache_messages"] and
                        not isSuppressedBulkHam and not isTooBig):
                        # Write the message into the Unknown cache.
                        message = state.unknownCorpus.makeMessage(msg.getId())
                        message.setSubstance(msg.as_string())
                        state.unknownCorpus.addMessage(message)

                # We'll return the message with the headers added.  We take
                # all the headers from the SBHeaderMessage, but take the body
                # directly from the POP3 conversation, because the
                # SBHeaderMessage might have "fixed" a partial message by
                # appending a closing boundary separator.  Remember we can
                # be dealing with partial message here because of the timeout
                # code in onServerLine.
                headers = []
                for name, value in msg.items():
                    header = "%s: %s" % (name, value)
                    headers.append(re.sub(r'\r?\n', '\r\n', header))
                body = re.split(r'\n\r?\n', messageText, 1)[1]
                messageText = "\r\n".join(headers) + "\r\n\r\n" + body

            except:
                # Something nasty happened while parsing or classifying -
                # report the exception in a hand-appended header and recover.
                # This is one case where an unqualified 'except' is OK, 'cos
                # anything's better than destroying people's email...
                stream = cStringIO.StringIO()
                traceback.print_exc(None, stream)
                details = stream.getvalue()
                
                # Build the header.  This will strip leading whitespace from
                # the lines, so we add a leading dot to maintain indentation.
                detailLines = details.strip().split('\n')
                dottedDetails = '\n.'.join(detailLines)
                headerName = 'X-Spambayes-Exception'
                header = Header(dottedDetails, header_name=headerName)
                
                # Insert the header, converting email.Header's '\n' line
                # breaks to POP3's '\r\n'.
                headers, body = re.split(r'\n\r?\n', messageText, 1)
                header = re.sub(r'\r?\n', '\r\n', str(header))
                headers += "\n%s: %s\r\n\r\n" % (headerName, header)
                messageText = headers + body

                # Print the exception and a traceback.
                print >>sys.stderr, details

            # Restore the +OK and the POP3 .\r\n terminator if there was one.
            retval = ok + "\n" + messageText
            if terminatingDotPresent:
                retval += '.\r\n'
            return retval

        else:
            # Must be an error response.
            return response

    def onTop(self, command, args, response):
        """Adds the judgement header based on the raw headers and as
        much of the body as the TOP command retrieves."""
        # Easy (but see the caveat in BayesProxy.__doc__).
        return self.onRetr(command, args, response)

    def onUser(self, command, args, response):
        """Spins off three separate threads that expires any old messages
        in the three caches, but does not do any processing of the USER
        command itself."""
        start_new_thread(state.spamCorpus.removeExpiredMessages, ())
        start_new_thread(state.hamCorpus.removeExpiredMessages, ())
        start_new_thread(state.unknownCorpus.removeExpiredMessages, ())
        return response

    def onUnknown(self, command, args, response):
        """Default handler; returns the server's response verbatim."""
        return response


# This keeps the global state of the module - the command-line options,
# statistics like how many mails have been classified, the handle of the
# log file, the Classifier and FileCorpus objects, and so on.
class State:
    def __init__(self):
        """Initialises the State object that holds the state of the app.
        The default settings are read from Options.py and bayescustomize.ini
        and are then overridden by the command-line processing code in the
        __main__ code below."""
        # Open the log file.
        if options["globals", "verbose"]:
            self.logFile = open('_pop3proxy.log', 'wb', 0)

        self.servers = []
        self.proxyPorts = []
        if options["pop3proxy", "remote_servers"]:
            for server in options["pop3proxy", "remote_servers"]:
                server = server.strip()
                if server.find(':') > -1:
                    server, port = server.split(':', 1)
                else:
                    port = '110'
                self.servers.append((server, int(port)))

        if options["pop3proxy", "listen_ports"]:
            splitPorts = options["pop3proxy", "listen_ports"]
            self.proxyPorts = map(_addressAndPort, splitPorts)

        if len(self.servers) != len(self.proxyPorts):
            print "pop3proxy_servers & pop3proxy_ports are different lengths!"
            sys.exit()

        # Load up the other settings from Option.py / bayescustomize.ini
        self.useDB = options["Storage", "persistent_use_database"]
        self.uiPort = options["html_ui", "port"]
        self.launchUI = options["html_ui", "launch_browser"]
        self.gzipCache = options["pop3proxy", "cache_use_gzip"]
        self.cacheExpiryDays = options["pop3proxy", "cache_expiry_days"]
        self.runTestServer = False
        self.isTest = False

        # Set up the statistics.
        self.totalSessions = 0
        self.activeSessions = 0
        self.numSpams = 0
        self.numHams = 0
        self.numUnsure = 0

        # Unique names for cached messages - see `getNewMessageName()` below.
        self.lastBaseMessageName = ''
        self.uniquifier = 2

    def buildServerStrings(self):
        """After the server details have been set up, this creates string
        versions of the details, for display in the Status panel."""
        serverStrings = ["%s:%s" % (s, p) for s, p in self.servers]
        self.serversString = ', '.join(serverStrings)
        self.proxyPortsString = ', '.join(map(_addressPortStr, self.proxyPorts))

    def createWorkers(self):
        """Using the options that were initialised in __init__ and then
        possibly overridden by the driver code, create the Bayes object,
        the Corpuses, the Trainers and so on."""
        print "Loading database...",
        if self.isTest:
            self.useDB = True
            options["Storage", "persistent_storage_file"] = \
                        '_pop3proxy_test.pickle'   # This is never saved.
        filename = options["Storage", "persistent_storage_file"]
        filename = os.path.expanduser(filename)
        self.bayes = storage.open_storage(filename, self.useDB)

        # Don't set up the caches and training objects when running the self-test,
        # so as not to clutter the filesystem.
        if not self.isTest:
            def ensureDir(dirname):
                try:
                    os.mkdir(dirname)
                except OSError, e:
                    if e.errno != errno.EEXIST:
                        raise

            # Create/open the Corpuses.  Use small cache sizes to avoid hogging
            # lots of memory.
            map(ensureDir, [options["pop3proxy", "spam_cache"],
                            options["pop3proxy", "ham_cache"],
                            options["pop3proxy", "unknown_cache"]])
            if self.gzipCache:
                factory = GzipFileMessageFactory()
            else:
                factory = FileMessageFactory()
            age = options["pop3proxy", "cache_expiry_days"]*24*60*60
            self.spamCorpus = ExpiryFileCorpus(age, factory,
                                               options["pop3proxy",
                                                       "spam_cache"],
                                               '[0123456789\-]*',
                                               cacheSize=20)
            self.hamCorpus = ExpiryFileCorpus(age, factory,
                                              options["pop3proxy",
                                                      "ham_cache"],
                                              '[0123456789\-]*',
                                              cacheSize=20)
            self.unknownCorpus = ExpiryFileCorpus(age, factory,
                                            options["pop3proxy",
                                                    "unknown_cache"],
                                            '[0123456789\-]*',
                                                  cacheSize=20)

            # Given that (hopefully) users will get to the stage
            # where they do not need to do any more regular training to
            # be satisfied with spambayes' performance, we expire old
            # messages from not only the trained corpora, but the unknown
            # as well.
            self.spamCorpus.removeExpiredMessages()
            self.hamCorpus.removeExpiredMessages()
            self.unknownCorpus.removeExpiredMessages()

            # Create the Trainers.
            self.spamTrainer = storage.SpamTrainer(self.bayes)
            self.hamTrainer = storage.HamTrainer(self.bayes)
            self.spamCorpus.addObserver(self.spamTrainer)
            self.hamCorpus.addObserver(self.hamTrainer)

    def getNewMessageName(self):
        # The message name is the time it arrived, with a uniquifier
        # appended if two arrive within one clock tick of each other.
        messageName = "%10.10d" % long(time.time())
        if messageName == self.lastBaseMessageName:
            messageName = "%s-%d" % (messageName, self.uniquifier)
            self.uniquifier += 1
        else:
            self.lastBaseMessageName = messageName
            self.uniquifier = 2
        return messageName


# Option-parsing helper functions
def _addressAndPort(s):
    """Decode a string representing a port to bind to, with optional address."""
    s = s.strip()
    if ':' in s:
        addr, port = s.split(':')
        return addr, int(port)
    else:
        return '', int(s)

def _addressPortStr((addr, port)):
    """Encode a string representing a port to bind to, with optional address."""
    if not addr:
        return str(port)
    else:
        return '%s:%d' % (addr, port)


state = State()
proxyListeners = []
def _createProxies(servers, proxyPorts):
    """Create BayesProxyListeners for all the given servers."""
    for (server, serverPort), proxyPort in zip(servers, proxyPorts):
        listener = BayesProxyListener(server, serverPort, proxyPort)
        proxyListeners.append(listener)

def _recreateState():
    global state
    state = State()

    # Close the existing listeners and create new ones.  This won't
    # affect any running proxies - once a listener has created a proxy,
    # that proxy is then independent of it.
    for proxy in proxyListeners:
        proxy.close()
    del proxyListeners[:]

    prepare(state)
    _createProxies(state.servers, state.proxyPorts)
    
    return state

def main(servers, proxyPorts, uiPort, launchUI):
    """Runs the proxy forever or until a 'KILL' command is received or
    someone hits Ctrl+Break."""
    _createProxies(servers, proxyPorts)
    httpServer = UserInterfaceServer(uiPort)
    proxyUI = ProxyUserInterface(state, _recreateState)
    httpServer.register(proxyUI)
    Dibbler.run(launchBrowser=launchUI)

def prepare(state):
    # Do whatever we've been asked to do...
    state.createWorkers()

    # Launch any SMTP proxies.  Note that if the user hasn't specified any
    # SMTP proxy information in their configuration, then nothing will
    # happen.
    import sb_smtpproxy
    servers, proxyPorts = sb_smtpproxy.LoadServerInfo()
    proxyListeners.extend(sb_smtpproxy.CreateProxies(servers, proxyPorts,
                                                     state))

    # setup info for the web interface
    state.buildServerStrings()

def start(state):
    # kick everything off    
    main(state.servers, state.proxyPorts, state.uiPort, state.launchUI)

def stop(state):
    # Shutdown as though through the web UI.  This will save the DB, allow
    # any open proxy connections to complete, etc.
    from urllib2 import urlopen
    from urllib import urlencode
    urlopen('http://localhost:%d/save' % state.uiPort,
            urlencode({'how': 'Save & shutdown'})).read()


# ===================================================================
# __main__ driver.
# ===================================================================

def run():
    # Read the arguments.
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hbpsd:D:l:u:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    runSelfTest = False
    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-b':
            state.launchUI = True
        elif opt == '-d':   # dbm file
            state.useDB = True
            options["Storage", "persistent_storage_file"] = arg
        elif opt == '-D':   # pickle file
            state.useDB = False
            options["Storage", "persistent_storage_file"] = arg
        elif opt == '-p':   # dead option
            print >>sys.stderr, "-p option is no longer supported, use -D\n"
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-l':
            state.proxyPorts = [_addressAndPort(arg)]
        elif opt == '-u':
            state.uiPort = int(arg)

    # Let the user know what they are using...
    print get_version_string("POP3 Proxy")
    print "and engine %s.\n" % (get_version_string(),)

    prepare(state=state)

    if 0 <= len(args) <= 2:
        # Normal usage, with optional server name and port number.
        if len(args) == 1:
            state.servers = [(args[0], 110)]
        elif len(args) == 2:
            state.servers = [(args[0], int(args[1]))]
        
        # Default to listening on port 110 for command-line-specified servers.
        if len(args) > 0 and state.proxyPorts == []:
            state.proxyPorts = [('', 110)]

        start(state=state)

    else:
        print >>sys.stderr, __doc__

if __name__ == '__main__':
    run()

--- NEW FILE: sb_smtpproxy.py ---
#!/usr/bin/env python

"""A SMTP proxy to train a Spambayes database.

You point SMTP Proxy at your SMTP server(s) and configure your email
client(s) to send mail through the proxy (i.e. usually this means you use
localhost as the outgoing server).

To setup, enter appropriate values in your Spambayes configuration file in
the "SMTP Proxy" section (in particular: "remote_servers", "listen_ports",
and "use_cached_message").  This configuration can also be carried out via
the web user interface offered by POP3 Proxy and IMAP Filter.

To use, simply forward/bounce mail that you wish to train to the
appropriate address (defaults to spambayes_spam at localhost and
spambayes_ham at localhost).  All other mail is sent normally.
(Note that IMAP Filter and POP3 Proxy users should not execute this script;
launching of SMTP Proxy will be taken care of by those applicatons).

There are two main forms of operation.  With both, mail to two
(user-configurable) email addresses is intercepted by the proxy (and is
*not* sent to the SMTP server) and used as training data for a Spambayes
database.  All other mail is simply relayed to the SMTP server.

If the "use_cached_message" option is False, the proxy uses the message
sent as training data.  This option is suitable for those not using
POP3 Proxy or IMAP Filter, or for those that are confident that their
mailer will forward/bounce messages in an unaltered form.

If the "use_cached_message" option is True, the proxy examines the message
for a unique spambayes identification number.  It then tries to find this
message in the pop3proxy caches and on the imap servers.  It then retrieves
the message from the cache/server and uses *this* as the training data.
This method is suitable for those using POP3 Proxy and/or IMAP Filter, and
avoids any potential problems with the mailer altering messages before
forwarding/bouncing them.

Usage:
    smtpproxy [options]

	note: option values with spaces must be enclosed in double quotes

        options:
            -d  dbname  : pickled training database filename
            -D  dbname  : dbm training database filename
            -h          : help
            -v          : verbose mode
"""

# This module is part of the spambayes project, which is Copyright 2002-3
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tony Meyer <ta-meyer at ihug.co.nz>"
__credits__ = "Tim Stone, all the Spambayes folk."

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


todo = """
 o It would be nice if spam/ham could be bulk forwarded to the proxy,
   rather than one by one.  This would require separating the different
   messages and extracting the correct ids.  Simply changing to find
   *all* the ids in a message, rather than stopping after one *might*
   work, but I don't really know.  Richie Hindle suggested something along
   these lines back in September '02.
   
 o Suggestions?

Testing:

 o Test with as many clients as possible to check that the
   id is correctly extracted from the forwarded/bounced message.

MUA information:
A '*' in the Header column signifies that the smtpproxy can extract
the id from the headers only.  A '*' in the Body column signifies that
the smtpproxy can extract the id from the body of the message, if it
is there.
                                                        Header	Body
*** Windows 2000 MUAs ***
Eudora 5.2 Forward                                         *     *
Eudora 5.2 Redirect                                              *
Netscape Messenger (4.7) Forward (inline)                  *     *
Netscape Messenger (4.7) Forward (quoted) Plain      	         *
Netscape Messenger (4.7) Forward (quoted) HTML      	         *
Netscape Messenger (4.7) Forward (quoted) Plain & HTML       	 *       
Netscape Messenger (4.7) Forward (attachment) Plain 	   *     *	 
Netscape Messenger (4.7) Forward (attachment) HTML  	   *	 *
Netscape Messenger (4.7) Forward (attachment) Plain & HTML *  	 *
Outlook Express 6 Forward HTML (Base64)                          *
Outlook Express 6 Forward HTML (None)                            *
Outlook Express 6 Forward HTML (QP)                              *
Outlook Express 6 Forward Plain (Base64)                         *
Outlook Express 6 Forward Plain (None)                           *
Outlook Express 6 Forward Plain (QP)                             *
Outlook Express 6 Forward Plain (uuencoded)                      *
http://www.endymion.com/products/mailman Forward	             *
M2 (Opera Mailer 7.01) Forward                                   *
M2 (Opera Mailer 7.01) Redirect                            *     *
The Bat! 1.62i Forward (RFC Headers not visible)                 *
The Bat! 1.62i Forward (RFC Headers visible)               *     *
The Bat! 1.62i Redirect                                          *
The Bat! 1.62i Alternative Forward                         *     *
The Bat! 1.62i Custom Template                             *     *
AllegroMail 2.5.0.2 Forward                                      *
AllegroMail 2.5.0.2 Redirect                                     *
PocoMail 2.6.3 Bounce                                            *
PocoMail 2.6.3 Bounce                                            *
Pegasus Mail 4.02 Forward (all headers option set)         *     *
Pegasus Mail 4.02 Forward (all headers option not set)           *
Calypso 3 Forward                                                *
Calypso 3 Redirect                                         *     *
Becky! 2.05.10 Forward                                           *
Becky! 2.05.10 Redirect                                          *
Becky! 2.05.10 Redirect as attachment                      *     *
Mozilla Mail 1.2.1 Forward (attachment)                    *     *
Mozilla Mail 1.2.1 Forward (inline, plain)                 *1    *
Mozilla Mail 1.2.1 Forward (inline, plain & html)          *1    *
Mozilla Mail 1.2.1 Forward (inline, html)                  *1    *

*1 The header method will only work if auto-include original message
is set, and if view all headers is true.
"""

import string
import re
import socket
import asyncore
import asynchat
import getopt
import sys
import os

from spambayes import Dibbler
from spambayes import storage
from spambayes.message import sbheadermessage_from_string
from spambayes.tokenizer import textparts
from spambayes.tokenizer import try_to_repair_damaged_base64
from spambayes.Options import options
from sb_server import _addressPortStr, ServerLineReader
from sb_server import _addressAndPort

class SMTPProxyBase(Dibbler.BrighterAsyncChat):
    """An async dispatcher that understands SMTP and proxies to a SMTP
    server, calling `self.onTransaction(command, args)` for each
    transaction.

    self.onTransaction() should return the command to pass to
    the proxied server - the command can be the verbatim command or a
    processed version of it.  The special command 'KILL' kills it (passing
    a 'QUIT' command to the server).
    """

    def __init__(self, clientSocket, serverName, serverPort):
        Dibbler.BrighterAsyncChat.__init__(self, clientSocket)
        self.request = ''
        self.set_terminator('\r\n')
        self.command = ''           # The SMTP command being processed...
        self.args = ''              # ...and its arguments
        self.isClosing = False      # Has the server closed the socket?
        self.inData = False
        self.data = ""
        self.blockData = False
        self.serverSocket = ServerLineReader(serverName, serverPort,
                                             self.onServerLine)

    def onTransaction(self, command, args):
        """Overide this.  Takes the raw command and returns the (possibly
        processed) command to pass to the email client."""
        raise NotImplementedError

    def onProcessData(self, data):
        """Overide this.  Takes the raw data and returns the (possibly
        processed) data to pass back to the email client."""
        raise NotImplementedError

    def onServerLine(self, line):
        """A line of response has been received from the SMTP server."""
        # Has the server closed its end of the socket?
        if not line:
            self.isClosing = True

        # We don't process the return, just echo the response.
        self.push(line)
        self.onResponse()

    def collect_incoming_data(self, data):
        """Asynchat override."""
        self.request = self.request + data

    def found_terminator(self):
        """Asynchat override."""
        verb = self.request.strip().upper()
        if verb == 'KILL':
            self.socket.shutdown(2)
            self.close()
            raise SystemExit

        if self.request.strip() == '':
            # Someone just hit the Enter key.
            self.command = self.args = ''
        else:
            # A proper command.
            if self.request[:10].upper() == "MAIL FROM:":
                splitCommand = self.request.split(":", 1)
            elif self.request[:8].upper() == "RCPT TO:":
                splitCommand = self.request.split(":", 1)
            else:
                splitCommand = self.request.strip().split(None, 1)
            self.command = splitCommand[0]
            self.args = splitCommand[1:]

        if self.inData == True:
            self.data += self.request + '\r\n'
            if self.request == ".":
                self.inData = False
                cooked = self.onProcessData(self.data)
                self.data = ""
                if self.blockData == False:
                    self.serverSocket.push(cooked)
                else:
                    self.push("250 OK\r\n")
        else:
            cooked = self.onTransaction(self.command, self.args)
            if cooked is not None:
                self.serverSocket.push(cooked + '\r\n')
        self.command = self.args = self.request = ''

    def onResponse(self):
        # If onServerLine() decided that the server has closed its
        # socket, close this one when the response has been sent.
        if self.isClosing:
            self.close_when_done()

        # Reset.
        self.command = ''
        self.args = ''
        self.isClosing = False


class BayesSMTPProxyListener(Dibbler.Listener):
    """Listens for incoming email client connections and spins off
    BayesSMTPProxy objects to serve them."""

    def __init__(self, serverName, serverPort, proxyPort, trainer):
        proxyArgs = (serverName, serverPort, trainer)
        Dibbler.Listener.__init__(self, proxyPort, BayesSMTPProxy,
                                  proxyArgs)
        print 'SMTP Listener on port %s is proxying %s:%d' % \
               (_addressPortStr(proxyPort), serverName, serverPort)


class BayesSMTPProxy(SMTPProxyBase):
    """Proxies between an email client and a SMTP server, inserting
    judgement headers.  It acts on the following SMTP commands:

    o RCPT TO:
        o Checks if the recipient address matches the key ham or spam
          addresses, and if so notes this and does not forward a command to
          the proxied server.  In all other cases simply passes on the
          verbatim command.

     o DATA:
        o Notes that we are in the data section.  If (from the RCPT TO
          information) we are receiving a ham/spam message to train on,
          then do not forward the command on.  Otherwise forward verbatim.

    Any other commands are merely passed on verbatim to the server.          
    """

    def __init__(self, clientSocket, serverName, serverPort, trainer):
        SMTPProxyBase.__init__(self, clientSocket, serverName, serverPort)
        self.handlers = {'RCPT TO': self.onRcptTo, 'DATA': self.onData,
                         'MAIL FROM': self.onMailFrom}
        self.trainer = trainer
        self.isClosed = False
        self.train_as_ham = False
        self.train_as_spam = False

    def send(self, data):
        try:
            return SMTPProxyBase.send(self, data)
        except socket.error:
            # The email client has closed the connection - 40tude Dialog
            # does this immediately after issuing a QUIT command,
            # without waiting for the response.
            self.close()

    def close(self):
        # This can be called multiple times by async.
        if not self.isClosed:
            self.isClosed = True
            SMTPProxyBase.close(self)

    def stripAddress(self, address):
        """
        Strip the leading & trailing <> from an address.  Handy for
        getting FROM: addresses.
        """
        if '<' in address:
            start = string.index(address, '<') + 1
            end = string.index(address, '>')
            return address[start:end]
        else:
            return address

    def splitTo(self, address):
        """Return 'address' as undressed (host, fulladdress) tuple.
        Handy for use with TO: addresses."""
        start = string.index(address, '<') + 1
        sep = string.index(address, '@') + 1
        end = string.index(address, '>')
        return (address[sep:end], address[start:end],)

    def onTransaction(self, command, args):
        handler = self.handlers.get(command.upper(), self.onUnknown)
        return handler(command, args)

    def onProcessData(self, data):
        if self.train_as_spam:
            self.trainer.train(data, True)
            self.train_as_spam = False
            return ""
        elif self.train_as_ham:
            self.trainer.train(data, False)
            self.train_as_ham = False
            return ""
        return data

    def onRcptTo(self, command, args):
        toHost, toFull = self.splitTo(args[0])
        if toFull == options["smtpproxy", "spam_address"]:
            self.train_as_spam = True
            self.train_as_ham = False
            self.blockData = True
            self.push("250 OK\r\n")
            return None
        elif toFull == options["smtpproxy", "ham_address"]:
            self.train_as_ham = True
            self.train_as_spam = False
            self.blockData = True
            self.push("250 OK\r\n")
            return None
        else:
            self.blockData = False
        return "%s:%s" % (command, ' '.join(args))
        
    def onData(self, command, args):
        self.inData = True
        if self.train_as_ham == True or self.train_as_spam == True:
            self.push("250 OK\r\n")
            return None
        rv = command
        for arg in args:
            rv += ' ' + arg
        return rv

    def onMailFrom(self, command, args):
        """Just like the default handler, but has the necessary colon."""
        rv = "%s:%s" % (command, ' '.join(args))
        return rv

    def onUnknown(self, command, args):
        """Default handler."""
        return self.request


class SMTPTrainer(object):
    def __init__(self, classifier, state=None, imap=None):
        self.classifier = classifier
        self.state = state
        self.imap = imap
    
    def extractSpambayesID(self, data):
        msg = message_from_string(data)

        # The nicest MUA is one that forwards the header intact.
        id = msg.get(options["Headers", "mailid_header_name"])
        if id is not None:
            return id

        # Some MUAs will put it in the body somewhere, while others will
        # put it in an attached MIME message.
        id = self._find_id_in_text(msg.as_string())
        if id is not None:
            return id

        # the message might be encoded
        for part in textparts(msg):
            # Decode, or take it as-is if decoding fails.
            try:
                text = part.get_payload(decode=True)
            except:
                text = part.get_payload(decode=False)
                if text is not None:
                    text = try_to_repair_damaged_base64(text)
            if text is not None:
                id = self._find_id_in_text(text)
                return id
        return None

    header_pattern = re.escape(options["Headers", "mailid_header_name"])
    # A MUA might enclose the id in a table, thus the convoluted re pattern
    # (Mozilla Mail does this with inline html)
    header_pattern += r":\s*(\</th\>\s*\<td\>\s*)?([\d\-]+)"
    header_re = re.compile(header_pattern)

    def _find_id_in_text(self, text):
        mo = self.header_re.search(text)
        if mo is None:
            return None
        return mo.group(2)

    def train(self, msg, isSpam):
        try:
            use_cached = options["smtpproxy", "use_cached_message"]
        except KeyError:
            use_cached = True
        if use_cached:
            id = self.extractSpambayesID(msg)
            if id is None:
                print "Could not extract id"
                return
            self.train_cached_message(id, isSpam)
        # Otherwise, train on the forwarded/bounced message.
        msg = sbheadermessage_from_string(msg)
        id = msg.setIdFromPayload()
        msg.delSBHeaders()
        if id is None:
            # No id, so we don't have any reliable method of remembering
            # information about this message, so we just assume that it
            # hasn't been trained before.  We could generate some sort of
            # checksum for the message and use that as an id (this would
            # mean that we didn't need to store the id with the message)
            # but that might be a little unreliable.
            self.classifier.learn(msg.asTokens(), isSpam)
        else:
            if msg.GetTrained() == (not isSpam):
                self.classifier.unlearn(msg.asTokens(), not isSpam)
                msg.RememberTrained(None)
            if msg.GetTrained() is None:
                self.classifier.learn(msg.asTokens(), isSpam)
                msg.RememberTrained(isSpam)

    def train_cached_message(self, id, isSpam):
        if not self.train_message_in_pop3proxy_cache(id, isSpam) and \
           not self.train_message_on_imap_server(id, isSpam):
            print "Could not find message (%s); perhaps it was " + \
                  "deleted from the POP3Proxy cache or the IMAP " + \
                  "server.  This means that no training was done." % (id, )

    def train_message_in_pop3proxy_cache(self, id, isSpam):
        if self.state is None:
            return False
        sourceCorpus = None
        for corpus in [self.state.unknownCorpus, self.state.hamCorpus,
                       self.state.spamCorpus]:
            if corpus.get(id) is not None:
                sourceCorpus = corpus
                break
        if corpus is None:
            return False
        if isSpam == True:
            targetCorpus = self.state.spamCorpus
        else:
            targetCorpus = self.state.hamCorpus
        targetCorpus.takeMessage(id, sourceCorpus)
        self.classifier.store()

    def train_message_on_imap_server(self, id, isSpam):
        if self.imap is None:
            return False
        msg = self.imap.FindMessage(id)
        if msg is None:
            return False
        if msg.GetTrained() == (not isSpam):
            msg.get_substance()
            msg.delSBHeaders()
            self.classifier.unlearn(msg.asTokens(), not isSpam)
            msg.RememberTrained(None)
        if msg.GetTrained() is None:
            msg.get_substance()
            msg.delSBHeaders()
            self.classifier.learn(msg.asTokens(), isSpam)
            msg.RememberTrained(isSpam)

def LoadServerInfo():
    # Load the proxy settings
    servers = []
    proxyPorts = []
    if options["smtpproxy", "remote_servers"]:
        for server in options["smtpproxy", "remote_servers"]:
            server = server.strip()
            if server.find(':') > -1:
                server, port = server.split(':', 1)
            else:
                port = '25'
            servers.append((server, int(port)))
    if options["smtpproxy", "listen_ports"]:
        splitPorts = options["smtpproxy", "listen_ports"]
        proxyPorts = map(_addressAndPort, splitPorts)
    if len(servers) != len(proxyPorts):
        print "smtpproxy:remote_servers & smtpproxy:listen_ports are " + \
              "different lengths!"
        sys.exit()
    return servers, proxyPorts    

def CreateProxies(servers, proxyPorts, trainer):
    """Create BayesSMTPProxyListeners for all the given servers."""
    # allow for old versions of pop3proxy
    if not isinstance(trainer, SMTPTrainer):
        trainer = SMTPTrainer(trainer.bayes, trainer)
    proxyListeners = []
    for (server, serverPort), proxyPort in zip(servers, proxyPorts):
        listener = BayesSMTPProxyListener(server, serverPort, proxyPort,
                                          trainer)
        proxyListeners.append(listener)
    return proxyListeners

def main():
    """Runs the proxy until a 'KILL' command is received or someone hits
    Ctrl+Break."""
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hvd:D:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    bdbname = options["Storage", "persistent_storage_file"]
    useDBM = options["Storage", "persistent_use_database"]

    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
            sys.exit()
        elif opt == '-d':
            useDBM = False
            bdbname = arg
        elif opt == '-D':
            useDBM = True
            bdbname = arg
        elif opt == '-v':
            options["globals", "verbose"] = True

    bdbname = os.path.expanduser(bdbname)
    
    if options["globals", "verbose"]:
        print "Loading database %s..." % (bdbname),
    
    if useDBM:
        classifier = storage.DBDictClassifier(bdbname)
    else:
        classifier = storage.PickledClassifier(bdbname)

    if options["globals", "verbose"]:
        print "Done."            

    servers, proxyPorts = LoadServerInfo()
    trainer = SMTPTrainer(classifier)
    proxyListeners = CreateProxies(servers, proxyPorts, trainer)
    Dibbler.run()


if __name__ == '__main__':
    main()

--- NEW FILE: sb_unheader.py ---
#!/usr/bin/env python
"""
    unheader.py: cleans headers from email messages. By default, this
    removes SpamAssassin headers, specify a pattern with -p to supply
    new headers to remove.

    This is often needed because existing spamassassin headers can
    provide killer spam clues, for all the wrong reasons.
"""

import re
import sys
import os
import glob
import mailbox
import email.Parser
import email.Message
import email.Generator
import getopt

def unheader(msg, pat):
    pat = re.compile(pat)
    for hdr in msg.keys():
        if pat.match(hdr):
            del msg[hdr]

# remain compatible with 2.2.1 - steal replace_header from 2.3 source
class Message(email.Message.Message):
    def replace_header(self, _name, _value):
        """Replace a header.

        Replace the first matching header found in the message, retaining
        header order and case.  If no matching header was found, a
        KeyError is raised.
        """
        _name = _name.lower()
        for i, (k, v) in zip(range(len(self._headers)), self._headers):
            if k.lower() == _name:
                self._headers[i] = (k, _value)
                break
        else:
            raise KeyError, _name

class Parser(email.Parser.HeaderParser):
    def __init__(self):
        email.Parser.Parser.__init__(self, Message)

def deSA(msg):
    if msg['X-Spam-Status']:
        if msg['X-Spam-Status'].startswith('Yes'):
            pct = msg['X-Spam-Prev-Content-Type']
            if pct:
                msg['Content-Type'] = pct

            pcte = msg['X-Spam-Prev-Content-Transfer-Encoding']
            if pcte:
                msg['Content-Transfer-Encoding'] = pcte

            subj = re.sub(r'\*\*\*\*\*SPAM\*\*\*\*\* ', '',
                          msg['Subject'] or "")
            if subj != msg["Subject"]:
                msg.replace_header("Subject", subj)

            body = msg.get_payload()
            newbody = []
            at_start = 1
            for line in body.splitlines():
                if at_start and line.startswith('SPAM: '):
                    continue
                elif at_start:
                    at_start = 0
                newbody.append(line)
            msg.set_payload("\n".join(newbody))
    unheader(msg, "X-Spam-")

def process_message(msg, dosa, pats):
    if pats is not None:
        unheader(msg, pats)
    if dosa:
        deSA(msg)

def process_mailbox(f, dosa=1, pats=None):
    gen = email.Generator.Generator(sys.stdout, maxheaderlen=0)
    for msg in mailbox.PortableUnixMailbox(f, Parser().parse):
        process_message(msg, dosa, pats)
        gen(msg, unixfrom=1)

def process_maildir(d, dosa=1, pats=None):
    parser = Parser()
    for fn in glob.glob(os.path.join(d, "cur", "*")):
        print ("reading from %s..." % fn),
        file = open(fn)
        msg = parser.parse(file)
        process_message(msg, dosa, pats)

        tmpfn = os.path.join(d, "tmp", os.path.basename(fn))
        tmpfile = open(tmpfn, "w")
        print "writing to %s" % tmpfn
        email.Generator.Generator(tmpfile, maxheaderlen=0)(msg, unixfrom=0)

        os.rename(tmpfn, fn)

def usage():
    print >> sys.stderr, "usage: unheader.py [ -p pat ... ] [ -s ] folder"
    print >> sys.stderr, "-p pat gives a regex pattern used to eliminate unwanted headers"
    print >> sys.stderr, "'-p pat' may be given multiple times"
    print >> sys.stderr, "-s tells not to remove SpamAssassin headers"
    print >> sys.stderr, "-d means treat folder as a Maildir"

def main(args):
    headerpats = []
    dosa = 1
    ismbox = 1
    try:
        opts, args = getopt.getopt(args, "p:shd")
    except getopt.GetoptError:
        usage()
        sys.exit(1)
    else:
        for opt, arg in opts:
            if opt == "-h":
                usage()
                sys.exit(0)
            elif opt == "-p":
                headerpats.append(arg)
            elif opt == "-s":
                dosa = 0
            elif opt == "-d":
                ismbox = 0
        pats = headerpats and "|".join(headerpats) or None

        if len(args) != 1:
            usage()
            sys.exit(1)

        if ismbox:
            f = file(args[0])
            process_mailbox(f, dosa, pats)
        else:
            process_maildir(args[0], dosa, pats)

if __name__ == "__main__":
    main(sys.argv[1:])

--- NEW FILE: sb_upload.py ---
#!/usr/bin/env python

"""
Read a message or a mailbox file on standard input, upload it to a
web browser and write it to standard output.

usage:  %(progname)s [-h] [-n] [-s server] [-p port] [-r N]

Options:
    -h, --help    - print help and exit
    -n, --null    - suppress writing to standard output (default %(null)s)
    -s, --server= - provide alternate web server (default %(server)s)
    -p, --port=   - provide alternate server port (default %(port)s)
    -r, --prob=   - feed the message to the trainer w/ prob N [0.0...1.0]
"""

import sys
import httplib
import mimetypes
import getopt
import random
from spambayes.Options import options

progname = sys.argv[0]

__author__ = "Skip Montanaro <skip at pobox.com>"
__credits__ = "Spambayes gang, Wade Leftwich"

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

# appropriated verbatim from a recipe by Wade Leftwich in the Python
# Cookbook: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/146306

def post_multipart(host, selector, fields, files):
    """
    Post fields and files to an http host as multipart/form-data.  fields is
    a sequence of (name, value) elements for regular form fields.  files is
    a sequence of (name, filename, value) elements for data to be uploaded
    as files.  Return the server's response page.
    """
    content_type, body = encode_multipart_formdata(fields, files)
    h = httplib.HTTP(host)
    h.putrequest('POST', selector)
    h.putheader('content-type', content_type)
    h.putheader('content-length', str(len(body)))
    h.endheaders()
    h.send(body)
    errcode, errmsg, headers = h.getreply()
    return h.file.read()

def encode_multipart_formdata(fields, files):
    """
    fields is a sequence of (name, value) elements for regular form fields.
    files is a sequence of (name, filename, value) elements for data to be
    uploaded as files.  Return (content_type, body) ready for httplib.HTTP
    instance
    """
    BOUNDARY = '----------ThIs_Is_tHe_bouNdaRY_$'
    CRLF = '\r\n'
    L = []
    for (key, value) in fields:
        L.append('--' + BOUNDARY)
        L.append('Content-Disposition: form-data; name="%s"' % key)
        L.append('')
        L.append(value)
    for (key, filename, value) in files:
        L.append('--' + BOUNDARY)
        L.append('Content-Disposition: form-data; name="%s"; filename="%s"' % (key, filename))
        L.append('Content-Type: %s' % get_content_type(filename))
        L.append('')
        L.append(value)
    L.append('--' + BOUNDARY + '--')
    L.append('')
    body = CRLF.join(L)
    content_type = 'multipart/form-data; boundary=%s' % BOUNDARY
    return content_type, body

def get_content_type(filename):
    return mimetypes.guess_type(filename)[0] or 'application/octet-stream'

def usage(*args):
    defaults = {}
    for d in args:
        defaults.update(d)
    print __doc__ % defaults

def main(argv):
    null = False
    server = "localhost"
    port = options["html_ui", "port"]
    prob = 1.0

    try:
        opts, args = getopt.getopt(argv, "hns:p:r:",
                                   ["help", "null", "server=", "port=",
                                    "prob="])
    except getopt.error:
        usage(globals(), locals())
        sys.exit(1)

    for opt, arg in opts:
        if opt in ("-h", "--help"):
            usage(globals(), locals())
            sys.exit(0)
        elif opt in ("-n", "--null"):
            null = True
        elif opt in ("-s", "--server"):
            server = arg
        elif opt in ("-p", "--port"):
            port = int(arg)
        elif opt in ("-r", "--prob"):
            n = float(arg)
            if n < 0.0 or n > 1.0:
                usage(globals(), locals())
                sys.exit(1)
            prob = n

    if args:
        usage(globals(), locals())
        sys.exit(1)

    data = sys.stdin.read()
    sys.stdout.write(data)
    if random.random() < prob:
        try:
            post_multipart("%s:%d"%(server,port), "/upload", [],
                           [('file', 'message.dat', data)])
        except:
            # not an error if the server isn't responding
            pass

if __name__ == "__main__":
    main(sys.argv[1:])

--- NEW FILE: sb_xmlrpcserver.py ---
#! /usr/bin/env python

# A server version of hammie.py


"""Usage: %(program)s [options] IP:PORT

Where:
    -h
        show usage and exit
    -p FILE
        use file as the persistent store.  loads data from this file if it
        exists, and saves data to this file at the end.  Default: %(DEFAULTDB)s
    -d
        use the DBM store instead of cPickle.  The file is larger and
        creating it is slower, but checking against it is much faster,
        especially for large word databases.

    IP
        IP address to bind (use 0.0.0.0 to listen on all IPs of this machine)
    PORT
        Port number to listen to.
"""

import SimpleXMLRPCServer
import getopt
import sys
import traceback
import xmlrpclib
from spambayes import hammie

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


program = sys.argv[0] # For usage(); referenced by docstring above

# Default DB path
DEFAULTDB = hammie.DEFAULTDB

class XMLHammie(hammie.Hammie):
    def score(self, msg, *extra):
        try:
            msg = msg.data
        except AttributeError:
            pass
        return xmlrpclib.Binary(hammie.Hammie.score(self, msg, *extra))

    def filter(self, msg, *extra):
        try:
            msg = msg.data
        except AttributeError:
            pass
        return xmlrpclib.Binary(hammie.Hammie.filter(self, msg, *extra))


class HammieHandler(SimpleXMLRPCServer.SimpleXMLRPCRequestHandler):
    def do_POST(self):
        """Handles the HTTP POST request.

        Attempts to interpret all HTTP POST requests as XML-RPC calls,
        which are forwarded to the _dispatch method for handling.

        This one also prints out tracebacks, to help me debug :)
        """

        try:
            # get arguments
            data = self.rfile.read(int(self.headers["content-length"]))
            params, method = xmlrpclib.loads(data)

            # generate response
            try:
                response = self._dispatch(method, params)
                # wrap response in a singleton tuple
                response = (response,)
            except:
                traceback.print_exc()
                # report exception back to server
                response = xmlrpclib.dumps(
                    xmlrpclib.Fault(1, "%s:%s" % (sys.exc_type, sys.exc_value))
                    )
            else:
                response = xmlrpclib.dumps(response, methodresponse=1)
        except:
            # internal error, report as HTTP server error
            traceback.print_exc()
            print `data`
            self.send_response(500)
            self.end_headers()
        else:
            # got a valid XML RPC response
            self.send_response(200)
            self.send_header("Content-type", "text/xml")
            self.send_header("Content-length", str(len(response)))
            self.end_headers()
            self.wfile.write(response)

            # shut down the connection
            self.wfile.flush()
            self.connection.shutdown(1)


def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)


def main():
    """Main program; parse options and go."""
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'hdp:')
    except getopt.error, msg:
        usage(2, msg)

    pck = DEFAULTDB
    usedb = False
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-p':
            pck = arg
        elif opt == "-d":
            usedb = True

    if len(args) != 1:
        usage(2, "IP:PORT not specified")

    ip, port = args[0].split(":")
    port = int(port)

    bayes = hammie.createbayes(pck, usedb)
    h = XMLHammie(bayes)

    server = SimpleXMLRPCServer.SimpleXMLRPCServer((ip, port), HammieHandler)
    server.register_instance(h)
    server.serve_forever()

if __name__ == "__main__":
    main()

--- sb-client.py DELETED ---

--- sb-dbexpimp.py DELETED ---

--- sb-filter.py DELETED ---

--- sb-imapfilter.py DELETED ---

--- sb-mailsort.py DELETED ---

--- sb-mboxtrain.py DELETED ---

--- sb-notesfilter.py DELETED ---

--- sb-pop3dnd.py DELETED ---

--- sb-server.py DELETED ---

--- sb-smtpproxy.py DELETED ---

--- sb-unheader.py DELETED ---

--- sb-upload.py DELETED ---

--- sb-xmlrpcserver.py DELETED ---





More information about the Spambayes-checkins mailing list