[Spambayes] Corpus modules
Tim Stone - Four Stones Expressions
tim@fourstonesExpressions.com
Wed Nov 13 02:13:14 2002
I've been working with Richie Hindle to create modules that are useful for
managing corpora for his pop3proxy. There is a Corpus class, a Message class,
and a MessageFactory class, with subclasses that add persistence into a file
system as text or gzip files in subdirectories. There's a Trainer class that
observes Corpus instances and untrains/trains a bayes database as messages are
moved between them. (Corpus is defined simply as a collection of messages).
I've also got a BayesHelper class, that adds persistence to a Bayes object,
that is imported from classifier (Bayes) or hammie(PersistentBayes).
Assuming that I can get these things checked in sometime soon, they may be
useful outside of the pop3proxy. I see some overlap with Messages and the
msgs.py module. Also, the BayesHelper thing really doesn't belong in the
Corpus.py module.
So there's the context of my question(s) ;) Now for the questions.
Hammie has interesting PersistentBayes and DB_Dict classes, with some helper
functions for bayes object creation. It seems to me that a more cogent class
hierarchy is called for, with Bayes being the abstract class, PersistentBayes
being an abstract subclass, and subclasses of that for particular persistence
mechanisms, like PickleBayes, ZODBBayes, DBDictBayes, etc. etc.
It doesn't make a lot of sense to me to have the Bayes class in classifier and
the PersistentBayes class in hammie... It would seem much more consistent to
me to have a Bayes.py module, with all the bayes database classes. There
might be a lot of momentum behind the hammie.py module, perhaps too much to
change directions now, but hammie doesn't tell me much about what this module
is really for, and when I look in it, I don't see much coherence either.
The current scheme that I have in Corpus is to have a trainer object that
knows about its Bayes object, and trains it in response observed message
movement events. This is mainly a hack. It would be better for these bayes
objects to be able to be the Corpus observers, and forget about this
artificial Trainer object.
Right now, my Message objects are fairly dumb. They simply wrap entire
messages, which are used for training. It seems as if the training methods on
Bayes create objects from msgs.py which have a lot more smarts in them, like
'gimme the headers', 'gimme the body', 'gimme a wordstream', etc. However, my
Message objects have some attributes that are specifically useful for the
pop3proxy handling of incoming pop3 mail, specifically persistence. Should
these two classes be merged, could the msgs.py objects become more useful for
the pop3proxy, or could my Message class become more broadly useful? It seems
that the current msgs classes are useful for test training, and for deep
within the bowels of the training algorithms, but would not be too useful for
the pop3proxy...
So I guess in summary, I propose that we create a Bayes.py module with guts
from the current classifier and hammie modules, and make a Message class
that's broadly useful, both for corpus management and for training... It's my
itch, so I'm willing to scratch it, but what do the rest of you think?
Musings of a latecomer to the party...
- TimS
www.fourstonesExpressions.com
More information about the Spambayes
mailing list