[Spambayes] Corpus modules

Wed Nov 13 02:13:14 2002

I've been working with Richie Hindle to create modules that are useful for 
managing corpora for his pop3proxy.  There is a Corpus class, a Message class, 
and a MessageFactory class, with subclasses that add persistence into a file 
system as text or gzip files in subdirectories.  There's a Trainer class that 
observes Corpus instances and untrains/trains a bayes database as messages are 
moved between them.  (Corpus is defined simply as a collection of messages).  
I've also got a BayesHelper class, that adds persistence to a Bayes object, 
that is imported from classifier (Bayes) or hammie(PersistentBayes).

Assuming that I can get these things checked in sometime soon, they may be 
useful outside of the pop3proxy.  I see some overlap with Messages and the 
msgs.py module.  Also, the BayesHelper thing really doesn't belong in the 
Corpus.py module.

So there's the context of my question(s) ;)  Now for the questions.

Hammie has interesting PersistentBayes and DB_Dict classes, with some helper 
functions for bayes object creation.  It seems to me that a more cogent class 
hierarchy is called for, with Bayes being the abstract class, PersistentBayes 
being an abstract subclass, and subclasses of that for particular persistence 
mechanisms, like PickleBayes, ZODBBayes, DBDictBayes, etc. etc.

It doesn't make a lot of sense to me to have the Bayes class in classifier and 
the PersistentBayes class in hammie...  It would seem much more consistent to 
me to have a Bayes.py module, with all the bayes database classes.  There 
might be a lot of momentum behind the hammie.py module, perhaps too much to 
change directions now, but hammie doesn't tell me much about what this module 
is really for, and when I look in it, I don't see much coherence either.

The current scheme that I have in Corpus is to have a trainer object that 
knows about its Bayes object, and trains it in response observed message 
movement events.  This is mainly a hack.  It would be better for these bayes 
objects to be able to be the Corpus observers, and forget about this 
artificial Trainer object.

Right now, my Message objects are fairly dumb.  They simply wrap entire 
messages, which are used for training.  It seems as if the training methods on 
Bayes create objects from msgs.py which have a lot more smarts in them, like 
'gimme the headers', 'gimme the body', 'gimme a wordstream', etc.  However, my 
Message objects have some attributes that are specifically useful for the 
pop3proxy handling of incoming pop3 mail, specifically persistence.  Should 
these two classes be merged, could the msgs.py objects become more useful for 
the pop3proxy, or could my Message class become more broadly useful?  It seems 
that the current msgs classes are useful for test training, and for deep 
within the bowels of the training algorithms, but would not be too useful for 
the pop3proxy...

So I guess in summary, I propose that we create a Bayes.py module with guts 
from the current classifier and hammie modules, and make a Message class 
that's broadly useful, both for corpus management and for training...  It's my 
itch, so I'm willing to scratch it, but what do the rest of you think?

Musings of a latecomer to the party...

- TimS
www.fourstonesExpressions.com