[Spambayes] Training a procmail filter for a Cyrus IMAP server

Wed Feb 4 19:59:17 EST 2004

> I've actually taken the first step -- I've configured the
> imapfilter and started training it.  It ran for 5.5 hours
> last night before it hit a problem with bogus date headers.

The next release should handle these sorts of things much more gracefully.
If there are then still any issues like this they should get reported and
fixed reasonably quickly.

> One thing for people to note.  This is probably obvious to
> the afficiandos, but it wasn't obvious to me:  the trainer
> adds a line to the mail message headers (even though training
> seems like a read-only operation).  The effect of this is that
> my mail clients discovered that their caches of the message
> headers were now stale.  This wasn't a big deal here at work
> with the multi-megabit network connection, but, at home, with
> my soda-straw dial-up, this was a bit painful. 

The reason for the header is to keep track of which messages have been
processed.  Otherwise, when you restarted the training, it would have
started training all the ones you had already done.

There are better ways to do this (and even ones that would make the code
simpler), but this was the decision (mine, I admit) at the time of writing.
Unfortunately, no-one's really interested in progressing the imapfilter code
(various people are willing to bugfix it, but that's about it), so it's
likely to stay that way for a while at least.

If anyone is interested in patching it up, IIRC, the suggested solution is
to base the id on the message's MessageID instead of creating a new one
(imap ids aren't reliable enough).  If the message doesn't have a MessageID,
then it can have one added to it (so at least some messages won't be
changed, even if some are).  This involves a fair bit of coding, though, as
well as lots of testing.  I'm happy to do some testing, but I don't have
time for the coding at the moment, sorry.

> One other question, while I'm here.  What's the deal with using
> a database vs. a pickle?  I understand that the former is
> supposed to be faster for a single message lookup, and the later
> is better for bulk training.  But, I presume that what I want
> (once I'm done training) is a database.  How do I convert the
> pickle into a database?

There's a script called sb_dbexpimp.py in the scripts directory.  This will
convert a database, whatever form it's in, into any other form, include a
flat-text format.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.