[Spambayes] Email client integration -- what's needed?

Mark Hammond mhammond@skippinet.com.au
Sat Nov 2 00:27:12 2002


>   TP> trained on as what.  Mark invented a bunch of code like that for
>   TP> the Outlook client, but there's really nothing Outlook-specific
>   TP> about it apart from the all Outlook-specific bits <wink>.  Those
>   TP> could be factored out, though.
>
> I should look at integrating Mark's code and my own training system
> based on VM folders.  See what common code falls out.

I tend to filter the Python zen thusly:

% python -c "import this" | grep purity
Although practicality beats purity.

However, I have tried to think a little about what a generic system would
look like.  For example, I tried to create a generic "message" object
family:

class MsgStore:
    def Close(self):
    def GetFolderGenerator(self, folder_ids, include_sub):
    def GetFolder(self, folder_id):
    def GetMessage(self, message_id):

class MsgStoreFolder:
    def GetMessageGenerator(self, folder):

class MsgStoreMsg:
    def GetEmailPackageObject(self, strip_mime_headers=True):
        # Return a "read-only" Python email package object
        # "read-only" in that changes will never be reflected to the real
store.
        raise NotImplementedError
    def SetField(self, name, value):
        # Abstractly set a user field name/id to a field value.
        # User field is for the user to see - status/internal fields
        # should get their own methods
        raise NotImplementedError
    def GetField(self, name):
        # Abstractly get a user field name/id to a field value.
        raise NotImplementedError
    def Save(self):
        # Save changes after field changes.
        raise NotImplementedError
    def MoveTo(self, folder_id):
        # Move the message to a folder.
        raise NotImplementedError
    def CopyTo(self, folder_id):
        # Copy the message to a folder.
        raise NotImplementedError

The essence of our training code is then:

def train_folder( f, isspam, mgr, progress):
    # fancy progress reporting code omitted
    for message in f.GetMessageGenerator():
        train_message(message, isspam, mgr)

def train_message(msg, is_spam, mgr):
    # Train an individual message.
    # Returns True if newly added (message will be correctly
    # untrained if it was in the wrong category), False if already
    # in the correct category.  Catch your own damn exceptions.
    from tokenizer import tokenize
    stream = msg.GetEmailPackageObject()
    tokens = tokenize(stream)
    # Handle we may have already been trained.
    was_spam = mgr.message_db.get(msg.searchkey)
    if was_spam is None:
        # never previously trained.
        pass
    elif was_spam == is_spam:
        # Already in DB - do nothing (full retrain will wipe msg db)
        # leave now.
        return False
    else:
        mgr.bayes.unlearn(tokens, was_spam, False)
    # OK - setup the new data.
    mgr.bayes.learn(tokens, is_spam, False)
    mgr.message_db[msg.searchkey] = is_spam
    mgr.bayes_dirty = True
    return True

As Tim says, not much Outlook specific here (some - eg, "msg.searchkey" -
but nothing too painful)

Mark.