[Spambayes] progress on POP+VM+ZODB deployment

Sat Oct 26 00:06:28 2002

[Jeremy Hylton]
> I don't know if anyone else on Earth wants to manage their mail the
> same way I do.  I've made some progress on hooking my mail up to
> spambayes, however, and wanted to report on the deploment issues.

Thanks for the report!

> I read my mail with VM, an emacs mail reader.  My mail collects on a
> couple of POP servers, and I fetch the mail directly from the POP
> servers using VM.
>
> I addressed the following issues:
>
> - Incremental training from VM folders
> - Scoring via a POP proxy
> - Management of training data using ZODB
>
> (I don't know if the last part was necessary or not, but I wanted to
> use ZODB.  I think it's simplified some things.)
>
> The runtime environment is fairly complicated.  It's got more moving
> parts than I would like, but I don't know how to eliminate any of
> them.

Check out the Outlook2000 directory -- there's already more code there than
in the tokenizer and classifier combined.  It's a remarkable and very
capable GUI, but still, email clients seem universally poorly designed for
programmability.

> It's also slower than I would like, but I haven't done enough
> profiling to really understand why.

MarkH made great progress in speeding the Outlook client via finding a way
to tell Outlook to deliver "batches" of msgs.  It's still at best twice as
slow (when bulk training or bulk classifying) as when running in "one msg
per plain text file" tests, but it's at least 30 msgs/second, and I don't
notice the speed drag at all when it's doing auto-filtering of incoming
email.  I do notice the increase in Outlook startup time, as it drags in
several pickles and lots of Python code (including mounds of the Python
win32 extensions).

Just for fun, I'd suggest training in a different way:  start with an empty
database and forget batch training!  Feed it examples from your live email.
The system does better than chance after training on one ham and one spam,
and it's fun & gratifying to see it get better in response to your training
efforts.  I've done that a few times now, and one day's worth of ham and
spam (of which I admittedly get a lot in a day -- about 100 spam) has always
been enough that it did better on its own then than my previous collection
of by-hand Outlook rules (which I eventually reduced to one, because they
made so many mistakes -- I don't use any now, except for spambayes-based
"spam" and "unsure" rules).

> There are a few open issues:
>
> - It was hard to use the classifier module with ZODB because of the
>   __slots__.

My understanding here is that this is a problem with inheritance from ZODB's
Persistent class.

>   I ended up using the WordInfo objects unchanged, and __slots__ there
>   helped minimize storage.  But I wanted to make the Bayes class
>   persistent and I couldn't do that because of the slots.   Since
>   there's only a single Bayes instance, I can't see why it needs to
>   use __slots__.

There may be more than one Bayes instance (for example, I believe Sean True
routinely uses several, faking N-way classification via chaining differently
trained binary classifiers), but the real reason I used slots here was for
their better error-detecting capabilities.  This *was* very rapidly changing
research code, and __slots__ caught early.  Fine by me if we nuke the Bayes
__slots__ now.

> - It thought it would be nice if spambayes was a package, so I could
>   separate it from my code.  It can't work as a package, though,
>   because it contains a copy of the email package.  When I turned
>   spambayes into a package, it ended up treating email as a
>   subpackage.  My apps ended up getting two copies of the email
>   package loaded -- one from the std library and one as a subpackage
>   of spambayes.  The duplication broke a bunch of isinstance() tests.

As Barry pointed out yesterday, Python 2.2.2 users don't need the duplicated
email pkg at all.  Neither people using CVS Python.  We should nuke it.
People who want to run under 2.2.1 should then work out what they need to do
to fiddle their PYTHONPATH to get a backported copy loaded.

> - Configuration.  It would be nice to use the existing options
>   framework and extend it with application-specific options (like the
>   POP ports, the ZEO server location, etc.).  It isn't clear what the
>   best way to extend Options is.

Name one way, and it will automatically become "the best" <wink>.  Something
that's been a minor problem in the Outlook client:  as soon as you load any
module in the spambayes core, it imports Options.py, and somtimes makes
module compile-time decisions based on the option values then in effect.
Setting the BAYESCUSTOMIZE envar after that point has no effect, since
Options has already been loaded.

> The different components involved in the setup are:
>
> - A ZEO server managing a ZODB database.

And you marvel at how many moving parts you've got <wink>?

>   I have a long-running ZEO server process.  By using ZEO, multiple
>   clients can access the database at the same time.  Clients connect
>   to the server using a Unix domain socket.

YAGNI for *you*, right?

> - A persistent mail profile based on VM folders.
>
>   The profile is stored in the database.  A VM folder is just a Unix
>   mailbox.  A config file contains a list of folders that contain ham
>   and a list of folders that contain spam.  The profile manages these
>   folders and a spambayes classifier.

Mark added another database to the Outlook client:  a mapping from (Outlook)
message id to whether it's been trained on as ham or spam (and a message id
is absent if neither).  So far this has at least two good effects:  (1) if a
mistake is moved from one flavor of training folder to another, the system
automatically knows to untrain it from the wrong flavor; (2) folder-based
training is much faster now, as it doesn't even bother to fetch msgs it
already trained on.

> - A training program, update.py.
>
>   The training program scans the folders listed in the profile.  When
>   it finds new messages, it learns from them.  When it finds that a
>   message was deleted, it unlearns it.

I don't think you'll want that over time:  If a msg has been deleted, fine,
it's gone but still trained.  Right now I'm carrying around many megabytes
of useless spam in my Outlook store, and that has lots of bad effects:
longer backup times, much longer scanpst times (the Outlook "inbox repair
tool"), and very much longer times to transfer my msg store between laptop
and desktop.  Only a researcher wants to carry dead spam around forever.

>   This process is incremental,  but it depends on the mailbox module
>   to parse the folders.  The parsing is definitely slow -- especially
>   for large folders.

Perhaps your moral equivalent to the Outlook client's msgid -> training
status map would be a jeremy_msg_id -> seek offset map, along with a
highwater mark offset to distinguish old from new msgs.

> - A POP3 proxy
>
>   I wrote my own proxy based on SocketServer.ThreadingTCPServer.  I
>   don't like the asynchat style of programming, and I was having
>   trouble integrating pop3proxy with ZEO.  They both use ZEO, but the
>   way they use them seemed to be causing deadlocks :-(.

That's unheard of in ZEO <wink>.

>   The proxy uses the strategy as pop3proxy, intercepting messages and
>   adding a spam score header.  I add a header like this:
>
>      From: Martijn Pieters <mj@zope.com>
>      To: <geeks@zope.com> (Zope.Com Geeks)
>      Cc: sa@zope.com
>      Subject: [Zope.Com Geeks] Zope.org storage server was down..
>      Date: Fri, 25 Oct 2002 17:10:42 -0400
>      X-Spambayes: 0.001
>
>   The proxy doesn't do anything other than add the header.

Is your ultimate email reader programmable enough to "do something" with
this?  One prediction I made for myself turned out to be just right:  moving
things automagically into Probable-Spam and Unsure folders is exactly what I
wanted and turns out to be exactly what I still want.  Works great.

> - A set of VM filters and tools for handling spam and training.
>
>   I wrote some little elisp functions.  One saves a message to the
>   spam training folder and deletes it.  Another saves a message to
>   the ham training folder, but does not delete it.  A third pipes it
>   to a small Python script that prints out the evidence for a message.
>
>   The next step is to add autofoldering rules that file spam above a
>   certain threshold to the spam folder and messages in the middle to
>   an unsure folder.  That's a standard VM thing, but I haven't done it
>   yet.

Thank you for answering my questions so quickly.

> The total code base is about 2000 lines of code, half of it in the POP
> proxy.  I'd be happy to check it in to the spambayes project if anyone
> else wants to try to use parts of it.

I'm bothered that you had no luck with the POP3 proxy already checked in.
Who's using that, and why didn't it work for Jeremy?