FW: RE: RE: [Spambayes] Email client integration -- what's needed?

Tim Peters tim.one@comcast.net
Tue Nov 5 04:59:26 2002


[TimS]
> Certainly a training class would have helped me get my head around the
> training side of things.  It's not really a trivial abstraction...

It will be if done right <wink> -- it's making it concrete that will be
non-trivial.

> Ok, I'll take a crack at the training class, then... got some ideas,
> but could use a few suggestions on some remembering stuff... All we
> can really ever count on having is a few basic headers and the message
> body.  Somehow from whatever we have, we need to create a key that
> will be used to find a saved message.  I could hash the entire
> message, or use a checksum... ideas?

I don't think the training class should know anything concrete about msgs.
Instead it should work with opaque message objects.  Off the top of my head,
msgs should support:

+ An arbitrary but consistent total ordering (so that they're usable
  as keys in B-Tree based persistent databases), and hashability (so that
  they're usable as keys in a dict).

+ A method to return a human-comprehensible name (perhaps an access
  path relative to the client's folder hierarchy -- but the training
  class shouldn't care).  Note that if these names are required to
  be unique strings, that can be exploited to give a consistent total
  ordering, and hashability (just compare or hash the string names).

+ A method to deliver a token stream, suitable for passing to the
  classifier.  I expect it would be most convenient to make msgs
  iterable, so they can be passed directly as-is to tokenize().

The existing msgs.Msg class does part of this stuff, but is a concrete
class, and geared toward testing.

A training class needs to specify a Msg interface (protocol, abstract base
clase, however you like to think of these things), and clients need to
supply classes or factory functions that implement that interface (protocol,
whatever).

Right?  This is just OO design:  identify the objects and actors in the
domain, and model them with classes.  The client will have to supply
concrete versions that implement the interfaces the trainer requires.  The
trick is to define the trainer in such a way that it requires exactly enough
to get its job done, and clients have to implement at least that much (but
may implement more).




More information about the Spambayes mailing list