[Spambayes] I am the author of my own undoing.

Webb Scales scales at zko.dec.com
Wed Mar 24 17:17:06 EST 2004


Tony Meyer wrote:

> Is there a reason that you don't want to use sb_imapfilter for the training
> as well as the classification?  I saw something in another message that said
> something about downloading the message twice, but I wasn't sure what you
> meant exactly.  Do you already have the message downloaded in some non-mbox
> form?  If so, maybe sb_mboxtrain can be manipulated into working with it.
> If not, then maybe we can adapt sb_imapfilter to do the job?

The situation is that I work from home in the mornings, using a Macintosh
running Netscape accessing my mail server inside my company's firewall through
a VPN, and I do this over a dial-up.  So, my network bandwidth is limited in
various ways and is inconsistent as well.

I prefer having my mail on an IMAP server, so that the stuff that I read in the
morning is still available to me in the afternoon when I'm in the office.

Now, in general, the systems at work are centrally administered, so I don't
really have root access.  (Or...if I reconfigured something and anybody
noticed, it would be a Bad Thing(tm).)  So, the mail server that is available
to me is a Cyrus IMAP server, which means that I have no access to the message
folders, except via the IMAP protocol (which is fine, even good, in several
senses).  But, it turns out that the mail server is set up so that it uses
procmail to deliver incoming mail to the IMAP server, so I do have access to
the messages for a brief shining moment before they are placed into the vault.
So, this is where I put SpamBayes.

This configuration has several advantages.  First, SB is invoked only when
needed (i.e., when a message comes in); it doesn't have to be started, always
runs when needed, and the message is handled on the machine where it is kept.
Now, I could use sb_imapfilter, but the configuration issues make it seem
inelegant in comparison.  I would have to figure out which machine to run it on
(there are many options with the mail server being the obvious choice, but
there is no way for me to auto-start it there, other than with a polling cron
job).  And, the message would have to be copied across from the mail server to
the server running sb_imapfilter and back (which would be OK if I picked an
office machine, but it would be basically awful if I tried to run it on my home
machine).

The best part about using the procmail filter is that my home mail client is
completely oblivious to arriving spam.  This is placed in the spam folder which
the client doesn't touch unless I specifically ask it to.  If I were running
sb_imapfilter, then there would be a competition between SB and my client as to
who sees the spam first, and my client would frequently end up downloading it
(at least the header) when it noticed it as "new mail".

So, I'm pretty much in love with the procmail filter.  (And, I'm getting ready
to have it just delete spam on contact, so that I *never* see it!  I'm really
liking SB.)

The obvious problem is, how do I training?  There are at least three answers.

  1. Manually export messages from my mail client to an mbox.  This is awkward,
     because I don't have direct access to the file server (where the mbox
     would live) from my home machine.  And, in any case, my mail client would
     download the message which I would then have to upload (somehow) to the
     mbox, which would be two trips for the message across the slow dial-up.  I
     could use the web-based training, but that's still two trips across the
     wire.
  2. Have the procmail filter set up an mbox with candidates.  The problem with
     this is, I then have to figure out how to access this mbox (which is
     somewhat more complicated than it sounds) so that I can vet its contents.
     And, there's also the problem of how to add false negatives to it (e.g., I
     have no easy access to it from my home mail client).
  3. Use my mail client to put the training messages into a training folder on
     the IMAP server, and then use sb_imapfilter to train them.  (I still have
     to download the message in order to read it, but this saves me from having
     to upload it again, and the operation is trivial to perform -- e.g., it
     doesn't require another mail program to manipulate an mbox or a
     cut-and-paste into a web browser or anything.)

But, this leaves me using two independent SB tools, which end up operating on
the same database without synchonization...hence my problem.


> > If not, is it reasonable for me to request that SpamBayes
> > synchronize access to the database files, to prevent the
> > sort of corruption that I caused?
>
> The policy (if you can call it that <wink>) so far has been that each of the
> scripts needs to ensure that they are well behaved in this manner, but it's
> up to the user to manage any multiple script difficulties.  Usually it's
> only Linux people using procmail or something like that that are in this
> situation anyway, and they're happy putting in whatever lock stuff goes
> there (I know nothing).  If you can come up with a scenario where SpamBayes
> really needs to be the one doing it, and no-one here can find a way around
> the problem, then yes, it is reasonable.  It may be reasonable even without
> that if you have a patch <0.5 wink>.

*heh*  Repeat after me, "I am unique...." :-)  Remi gave me the suggestion
that, since the procmail setup is already using a lockfile to prevent conflicts
with other invocations of procmail, I should wrap my sb_imapfilter invocation
in some code which also uses the same lockfile.  That's a fine suggestion, as
far as it goes, but I don't really know how procmail lockfiles work, and I
don't really know how to test whatever I might write (and, it would require me
to do my training on/from the mail server).

As far as coming up with a senario goes, I've outlined my situation above.  If
you think that that is sufficient to warrant a feature-request, I'm happy to
put one in.  The obvious workaround is to wrap the SB stuff in something that
provides the synchronization, but I would want/expect/hope-for that to be
built-in (but, yeah, I'm biased ;-).

As for a patch, I assume I could do that.  I'm supposed to be some kind of
high-powered software engineer, but I don't seem to have time to do anything
more than send and receive e-mail.... ;-)   [Um, how steep is the learning
curve on python?  (I sure hope someone will vet any code I supply!! 8-) ]


            Thanks,

                Webb


--
------------------------------------------------------------------------
Webb Scales                                Hewlett-Packard Company
scales at zko.dec.com                      110 Spit Brook Rd, ZKO2-3/N30
Voice: 603.884.2196, FAX: 603.884.0120     Nashua, NH 03062-2711
           Rule #12: Be joyful - seek the joy of being alive.
------------------------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20040324/f6942086/attachment.html


More information about the Spambayes mailing list