[Spambayes] Email client integration -- what's needed?
Mark Hammond
mhammond@skippinet.com.au
Sat Nov 2 00:27:12 2002
> TP> trained on as what. Mark invented a bunch of code like that for
> TP> the Outlook client, but there's really nothing Outlook-specific
> TP> about it apart from the all Outlook-specific bits <wink>. Those
> TP> could be factored out, though.
>
> I should look at integrating Mark's code and my own training system
> based on VM folders. See what common code falls out.
I tend to filter the Python zen thusly:
% python -c "import this" | grep purity
Although practicality beats purity.
However, I have tried to think a little about what a generic system would
look like. For example, I tried to create a generic "message" object
family:
class MsgStore:
def Close(self):
def GetFolderGenerator(self, folder_ids, include_sub):
def GetFolder(self, folder_id):
def GetMessage(self, message_id):
class MsgStoreFolder:
def GetMessageGenerator(self, folder):
class MsgStoreMsg:
def GetEmailPackageObject(self, strip_mime_headers=True):
# Return a "read-only" Python email package object
# "read-only" in that changes will never be reflected to the real
store.
raise NotImplementedError
def SetField(self, name, value):
# Abstractly set a user field name/id to a field value.
# User field is for the user to see - status/internal fields
# should get their own methods
raise NotImplementedError
def GetField(self, name):
# Abstractly get a user field name/id to a field value.
raise NotImplementedError
def Save(self):
# Save changes after field changes.
raise NotImplementedError
def MoveTo(self, folder_id):
# Move the message to a folder.
raise NotImplementedError
def CopyTo(self, folder_id):
# Copy the message to a folder.
raise NotImplementedError
The essence of our training code is then:
def train_folder( f, isspam, mgr, progress):
# fancy progress reporting code omitted
for message in f.GetMessageGenerator():
train_message(message, isspam, mgr)
def train_message(msg, is_spam, mgr):
# Train an individual message.
# Returns True if newly added (message will be correctly
# untrained if it was in the wrong category), False if already
# in the correct category. Catch your own damn exceptions.
from tokenizer import tokenize
stream = msg.GetEmailPackageObject()
tokens = tokenize(stream)
# Handle we may have already been trained.
was_spam = mgr.message_db.get(msg.searchkey)
if was_spam is None:
# never previously trained.
pass
elif was_spam == is_spam:
# Already in DB - do nothing (full retrain will wipe msg db)
# leave now.
return False
else:
mgr.bayes.unlearn(tokens, was_spam, False)
# OK - setup the new data.
mgr.bayes.learn(tokens, is_spam, False)
mgr.message_db[msg.searchkey] = is_spam
mgr.bayes_dirty = True
return True
As Tim says, not much Outlook specific here (some - eg, "msg.searchkey" -
but nothing too painful)
Mark.