[Spambayes] Corpus module (was: Upgrade problem)

Thu Nov 7 20:53:06 2002

[Tim Peters]
> it's another reason to create a dedicated "training class" module,
> so that various clients can at least share an *interface* for doing such
> stuff

Tim Stone and I have made a start on this (or rather Tim has and I've poked
my nose in) - I mention it because he's away until the weekend and we
wouldn't want anyone to duplicate the work.

It may be too early to talk details (and slightly rude in Tim's absence -
my apologies!) but here's the email I sent to Tim outlining how I thought
it might work.  I was thinking more about generic Message and Corpus
classes than specifically about training.

Laughing and pointing should be directed towards me rather than Tim.

-------------------------------------------------------------------------

[Tim S]
> We would include methods in Corpus to add a message to, remove a message from, 
> move from one to another, with the appropriate untraining/retraining built in.   
> We *could* have a method that, given a message substance (headers and body) 
> would find an existing message in a corpus that matched it (somehow).  We 
> would include metadata with the corpus that tells us whether it's a 
> spam/ham/untrained corpus, so the retraining can be done.  We could even 
> include a fourth type of corpus (cache) with methods to use expiry data in the 
> message metadata to remove old cache messages...

This is excellent stuff.  A Corpus contains Messages.  CacheCorpus is a
subclass of Corpus that adds the concept of expiry, and contains
CachedMessages (CachedMessage being a subclass of Message) that know about
their own expiry details (time of creation, size, time of last use,
whatever it depends on).  That's very neat.

A Corpus wouldn't know how to create Message objects, nor would a Message
object know how to create itself - classes *derived from* them would know
how to do that.  For instance (totally untested code, probably full of
typos) -

class Message:
    def __init__(self, messageText):
        """Pass in the text of the message, headers and body."""
        # etc.

    def name(self):
        """Returns a name for this message which is unique within its
        corpus."""
        raise NotImplementedError

class FileMessage(Message):
    """A Message representing an email stored in a file on disk."""

    def __init__(self, pathname):
        self.pathname = pathname
        messageFile = open(self.pathname)
        messageText = messageFile.read()
        Message.__init__(messageText)
        messageFile.close()

    def name(self):
        return self.pathname

...so the Message class dictates that all Messages must have name unique to
their corpus, but doesn't dictate how that name is determined.  Concrete
Message-derived classes fill in that detail.  [I may be putting too much
into the base class by demanding that the text of the message be given to
the constructor - that precludes making FileMessage lazy, and only read the
file when it needs to.]

'Corpus' works the same way; again, the details may be naive, but this is
the general idea:

class Corpus:
    """A collection of Message objects."""

    def __getitem__(self, messageName):
        """Makes Corpus act like a dictionary: a la corpus[messageName]"""
        raise NotImplementedError

class DirectoryCorpus(Corpus):
    """Represents a corpus of messages stored as individual files in a
    directory.  Example: corpus = DirectoryCorpus('mydir', '*.msg')"""

    def __init__(self, directoryPathname, globPattern):
        self.directoryPathname = directoryPathname
        self.globPattern = globPattern
        self.messageCache = {}  # The messages we're read from disk so far.

    def __getitem__(self, messageName):
        try:
            return self.messageCache[messageName]
        except KeyError:
            if not fnmatch.fnmatch(messageName, self.globPattern):
                raise KeyError, "Message name doesn't match naming pattern"
            pathname = os.path.join(self.directoryPathname, messageName)
            message = FileMessage(pathname)  # May raise IOError - let it.
            self.messageCache[messageName] = message
            return message

Here I've implemented the laziness idea by only reading the file when it's
asked for.

Maybe the message cache should go in Corpus - that would be useful for
*all* Corpus implementations.

You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

> move [Messages] from one [Corpus] to another, with the appropriate
> untraining/retraining built in.   

Yes - this could work using observer objects registered with Corpus
objects:

class CorpusObserver:
    """Derive your class from this and call corpus.addObserver to be
    informed when things happen to a corpus."""

    def onAddMessage(self, corpus, message):
        """Called when a message is added to a corpus."""
        pass   # Not NotImlementedError, so that people don't have to
               # implement *all* the event functions of CorpusObserver.

class Corpus:
    def __init__(self):
        self.observers = []   # List of CorpusObservers to inform of events

    def addObserver(self, observer):
        self.observers.append(observer)

    def addMessage(self, message):
        """External code adds messages by calling this - for example, in an
        OutlookCorpus it would be called as a result of the user dragging
        a message into the folder."""
        self.messageCache[message.name()] = message
        for observer in self.observers:
            observer.onAddMessage(self, message)

class AutoTrainer(CorpusObserver):
    """Trains the given classifier when messages are added or removed from
    the given Ham/Spam corpuses."""

    def __init__(self, bayes, hamCorpus, spamCorpus):
        self.bayes = bayes
        self.hamCorpus = hamCorpus
        self.spamCorpus = spamCorpus
        hamCorpus.addObserver(self)
        spamCorpus.addObserver(self)

    def onAddMessage(self, corpus, message):
        if corpus == self.spamCorpus:
            self.bayes.learn(tokenize(message), True)
        else:
            assert corpus == self.hamCorpus, "Unknown corpus"
            self.bayes.learn(tokenize(message), False)

...and likewise for removeMessage, onRemoveMessage and unlearn.

> I'm going to be travelling for the rest of the week, and may not be able to 
> connect, so you may not hear from me till Friday sometime...

OK.  Hopefully this will get to you before you leave, and give you plenty
to think about.  You might want to run it past Tim Peters, 'cos he's *far*
better at this kind of thing than I am (though he's also busy).  I think
this is the sort of thing he has in mind.

Most of the *new* code that's needed is defining the abstract concepts and
their interfaces, rather than writing code that actually *does* anything -
it's building a framework.

Once the framework is there, most of the code needed to implement the
functionality should already be in the project - code to hook into Outlook,
to train on a message, to parse mbox files, and so on.  It just needs
hooking into the framework.

The mark of a good framework is when you write a tiny little class (like
AutoTrainer above for instance) that contains hardly any code but adds a
major new feature (in this case, automatic training when moving messages
around in Outlook).

-------------------------------------------------------------------------

-- 
Richie Hindle
richie@entrian.com