[Spambayes] Re: Centralization (was: pedantism)

Thu Feb 6 23:42:10 EST 2003

Neale, here is the text of some messages that deal with various aspects of 
centralization of classification and message handling in spambayes.  It's a 
bit of reading, but the gist is that training, classification, and message 
handling should be done in one place regardless of what front-end is being 
integrated, and that the current Corpus and storage modules do not meet the 
bill, particularly for outlook.  This is because the paradigm that's being 
used in corpus is has a fairly serious impedance mismatch with outlook's 
message storage mechanism.  When I did the corpus stuff, it was particularly 
to support the pop3proxy.  I think now is the time to really rethink the 
abstractions involved, and make it work correctly for all clients we now have, 
and position it for whatever will come along... I'm already thinking about 
Lotus Notes, for instance.  Here ya go, tell me what you think...  - TimS

Neale Pickett <neale at woozle.org> wrote:

Mark Hammond wrote:

I tend to filter the Python zen thusly:

% python -c "import this" | grep purity
Although practicality beats purity.

However, I have tried to think a little about what a generic system would
look like.  For example, I tried to create a generic "message" object
family:

class MsgStore:
    def Close(self):
    def GetFolderGenerator(self, folder_ids, include_sub):
    def GetFolder(self, folder_id):
    def GetMessage(self, message_id):

class MsgStoreFolder:
    def GetMessageGenerator(self, folder):

class MsgStoreMsg:
    def GetEmailPackageObject(self, strip_mime_headers=True):
        # Return a "read-only" Python email package object
        # "read-only" in that changes will never be reflected to the real
store.
        raise NotImplementedError
    def SetField(self, name, value):
        # Abstractly set a user field name/id to a field value.
        # User field is for the user to see - status/internal fields
        # should get their own methods
        raise NotImplementedError
    def GetField(self, name):
        # Abstractly get a user field name/id to a field value.
        raise NotImplementedError
    def Save(self):
        # Save changes after field changes.
        raise NotImplementedError
    def MoveTo(self, folder_id):
        # Move the message to a folder.
        raise NotImplementedError
    def CopyTo(self, folder_id):
        # Copy the message to a folder.
        raise NotImplementedError

The essence of our training code is then:

def train_folder( f, isspam, mgr, progress):
    # fancy progress reporting code omitted
    for message in f.GetMessageGenerator():
        train_message(message, isspam, mgr)

def train_message(msg, is_spam, mgr):
    # Train an individual message.
    # Returns True if newly added (message will be correctly
    # untrained if it was in the wrong category), False if already
    # in the correct category.  Catch your own damn exceptions.
    from tokenizer import tokenize
    stream = msg.GetEmailPackageObject()
    tokens = tokenize(stream)
    # Handle we may have already been trained.
    was_spam = mgr.message_db.get(msg.searchkey)
    if was_spam is None:
        # never previously trained.
        pass
    elif was_spam == is_spam:
        # Already in DB - do nothing (full retrain will wipe msg db)
        # leave now.
        return False
    else:
        mgr.bayes.unlearn(tokens, was_spam, False)
    # OK - setup the new data.
    mgr.bayes.learn(tokens, is_spam, False)
    mgr.message_db[msg.searchkey] = is_spam
    mgr.bayes_dirty = True
    return True

As Tim says, not much Outlook specific here (some - eg, "msg.searchkey" -
but nothing too painful)

Mark.

Mark Hammond wrote:

I think that the classes I posted a while ago suffer from the exact reverse
problem as your idea.  My idea was to make a "message store" that is largely
independent of training.  I believe the problem with your design is that it
deals with the training at the expense of the message store.

Obviously, but worth mentioning, is that there are competing interests here.
My focus is towards clients, and specifically the outlook one (if there were
more clients I would be happy to think of them too <wink>).  Alot of the
focus of this group is towards admins rather than individuals (which is just
fine!)  But it seems the current thinking is of a corpus as being a fairly
static, well-controlled set of messages used almost purely for training
purposes.

For client programs, this may not be practical.  The corpus is a more
dynamic set of messages - and worse, actually *is* the user's set of
messages rather than a collection of message copies.

For example, "moving" a message in a corpus may actually mean moving the
message in the user's real inbox.  This may or may not be what is intended -
a corpus "move" operation is more about changing a message's classification
than it is about physically moving pieces of mail around.

> A Corpus wouldn't know how to create Message objects, nor would a Message
> object know how to create itself - classes *derived from* them would know
> how to do that.  For instance (totally untested code, probably full of
> typos) -
>
> class Message:

Jeremy and I both posted real code, so starting with something that takes
that into consideration would be good.

> I may be putting too much
> into the base class by demanding that the text of the message be given to
> the constructor - that precludes making FileMessage lazy, and
> only read the
> file when it needs to.]

It also defeats the abstract nature of the class.

> 'Corpus' works the same way; again, the details may be naive, but this is
> the general idea:

I'm hoping I don't sound grumpy, but again, the few systems that already
exist for this engine are the best ones to use to discover the naivety early
<wink>

> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

I can't quite imagine that at the moment, as per my comments at the top.

Off the top of my head, I believe we need:
* An abstract "message id"
* A message classification database, as discussed before - basically just a
dictionary, keyed by ID, holding either "spam" or "ham".
* A "corpus" becomes just an enumerator of message IDs for bulk/batch
training.  It has no move etc operations.
* A "message store" is capable of returning a message object given its ID.
* The training API simply takes message objects and updates the probability
and message databases.

At that level, we really don't need much else - no folders or any other
grouping of messages.  I'm really not too sure there is much value in adding
higher-level concepts such as folders or message store "move" operations -
certainly not at the outset, where there are too many competing
requirements.

> Yes - this could work using observer objects registered with Corpus
> objects:

This could work, but may be too simple to be necessary.  If the process of
re-training a message in the Outlook GUI becomes:

def RetrainMessageAsSpam():
	# Outlook specific code to get an ID.
	message = message_store.GetMessage(id)
	if not classifier.IsSpam(message):
		classifier.train(message, is_spam=True)

And not a whole lot else, it doesn't seem worth it.  Unfortunately, the
decision to perform the retrain is the complex, but client specific part.
Is this a newly delivered message?  Did the user manually move the message
somewhere?  Did the user click one of our buttons?  Is the user deleting old
ham that we want to train on before it dies forever?

Outlook does this via examining what Outlook event we are seeing, and
looking at meta-data we possibly previously attached to the message.  I'm
not sure this can be encapsulated well at the moment without adding all our
meta-data etc baggage to the base classes.

> Most of the *new* code that's needed is defining the abstract concepts and
> their interfaces, rather than writing code that actually *does* anything -
> it's building a framework.

*cough* ummm...  This is doomed to failure.  Code *must* do something to be
taken seriously.  At the very least, I would expect to see the existing test
driver framework running against these "abstract concepts" <wink>

> Once the framework is there, most of the code needed to implement the
> functionality should already be in the project - code to hook
> into Outlook,
> to train on a message, to parse mbox files, and so on.  It just needs
> hooking into the framework.

See above <wink>.

Mark.

Tim Stone wrote:

>Tim Stone - Four Stones Expressions <tim at fourstonesExpressions.com> writes:
>
>> I think that while you're at it, we should refactor the Corpus stuff,
>> so that messages and databases and training and classifying are all
>> handled in exactly one place in the system.  Richie has this idea of a
>> 'spambayes server' which is the heart and soul of the systems, and
>> that all the user facing stuff fronts.... what say you?  - TimS
>
>I do think the hammie stuff could stand to be a little more tightly
>integrated with the rest of the methods--or at least with the
>pop3proxy.  I was trying to do this with the Hammie class, and I think
>maybe my edits to mboxtrain achieve this pretty well.  But you guys are
>doing things I never dreamed of (like training via a web interface) and
>I haven't even begun to look at integrating that stuff.
>
>Hammiefilter is the simple case.  mboxtrain/hammiebulk are the difficult
>ones, as is proxytee.  So I'd be all for centralizing mailbox access and
>message stores.  What do you propose?
>
>

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org