[Spambayes] IMAP filtering

David Lang david.lang at digitalinsight.com
Fri Apr 16 19:18:22 EDT 2004


On Sat, 17 Apr 2004, Tony Meyer wrote:

> Date: Sat, 17 Apr 2004 10:47:23 +1200
> From: Tony Meyer <tameyer at ihug.co.nz>
> To: 'David Lang' <david.lang at digitalinsight.com>, spambayes at python.org
> Subject: RE: [Spambayes] IMAP filtering
>
> > 1. do to some unknown configuration bug in the exchange
> > server attachments from other exchange users cannot be read
> > via IMAP or POP3 (attachments sent via SMTP can be read) so
> > deleteing and re-posting these messages
> > would have the effect of stripping the attachments from them
>
> Note that if a message is classified as unsure/spam, then it also gets
> recreated (IMAP doesn't provide any means to move a message), so you'd lose
> this information then.  I doubt this would matter for true spam, but for
> false positives or ham unsures, this could be problematic (although note
> that the original isn't removed, just flagged for deletion).

another problem I am currently having is that once in a while a store to
IMAP locks up and never completes.

IMAP doesn't have a move, but it does have a copy command

> > 2. I am extremely nervous about deleting, modifying, and re-posting
> > messages that exchange uses for special purposes (calander scheduling
> > messages are a prime example), while they show up as mail
> > messages, they really are slightly different
>
> With your setup, do these appear in the same folder as mail messages?  Here,
> for example, all my Exchange folders have either mail *or* non-mail, and so
> I'd just filter those containing mail.  If they're scattered through the
> same folders, though, then this could be a problem.

they all appear in my inbox

> Actually, even without modifying the messages, this would seem to pose a
> problem, because the filter will try and classify these messages.  I have no
> idea how a scheduling message would be classified, but it's possible that it
> would be non-ham and end up moving to unsure/spam, which is probably not a
> good thing.  You might have to add some sort of code that identifies these
> messages and skips them.

I was figuring that they would end up getting trained into being ham
(probably by noticing some of the oddball stuff that would cause us
problems with moving them)

> > the fix that I am thinking of to resolve this would be to
> > change how the IMAP filter tracks the messges it has processed.
>
> This is certainly something that can/should be done at some point.
> Unfortunately, while the IMAP filter is used by a number of people, there
> isn't currently anyone who is taking a proactive role in developing it.  In
> fact, there never has been - Tim Stone & I initially wrote it to alleviate
> the frequent requests for such a filter.  I'm happy to maintain it (i.e. fix
> bugs and do simple improvements), but since I don't actually use it for
> day-to-day mail, I just can't find the time to take on a more active role
> with it.  I'd certainly be happy to pass the torch on to someone else, but
> no-one has stepped forward so far.
>
> The result is that non-simple changes are unlikely to occur unless we get
> patches (as I'm hoping you'll offer), and, especially important with IMAP,
> people testing the changes.  I'd also want to hold off checking any patches
> in until after 1.0 is out, since the current system is working reasonably
> well (but 1.0 shouldn't be too far off now that we finally have a beta out).
>
> > Instead of modifying the message itself if the filter tracked the
> > highest message number that it has processed it can process only
> > messages newer then that (the IMAP message ID is supposed to grow
> > larger with time).
>
> This isn't the ideal system, though.  The IMAP spec doesn't guarantee that
> the UID will continue to grow larger with time.  At pretty much any point,
> the server can decide to change the UID to anything it likes, as long as it
> changes the folder's id at the same time.  This could be solved by using
> some sort of combination of tracking the folder id and UID, but the folder
> id isn't guaranteed to behave in any reliable fashion, either.  AFAICT (and
> I and other people have gone through the RFC many times) there really isn't
> any way to get IMAP to produce a unique, constant, id for each message.
>
> Of course, any given IMAP server may actually do this, and many do.  But
> some don't, and the idea with the filter is to support as many flavours of
> IMAP as possible, which means that this isn't the way to go.  A similar
> method is to store a custom flag with the appropriate information (this is
> really the ideal way to go), except that not all servers support custom
> flags.  For you, I suspect that Exchange does support custom flags
> (instinct, not knowledge), so this might be a way for you to go.
>
> From past discussion, the best scheme that I've seen so far is:
>
>   1.  If the message has a Message-Id header, then use that as the id for
> the message.  This should be unique, will certainly be constant, and simple
> checks indicate that it's present in most messages.
>
>   1(a).  However, from other work with Exchange, messages from other
> Exchange users may very well *not* have a Message-ID header if they're still
> sitting on the server; I'm not sure - all the Exchange work I've done has
> involved an Outlook client.  If they don't, then they might have some sort
> of Exchange id that would work just as well; it'd be easy enough to check.
>
>   2.  Otherwise, get a checksum for the message (using one of the routines
> in the standard library) and use that as the id for the message.  This is
> most likely to be unique (especially if you include the headers, although
> you could have duplicates), and should be constant (because IMAP doesn't
> allow message text/headers to be changed).
>
> If you are interested in doing this, it might be worth reading through the
> messages in the spambayes-dev archives that discuss this.  Googling for
> "site:mail.python.org spambayes-dev imap" will get them (there aren't a lot
> of spambayes-dev messages about the imap filter, so there shouldn't be much
> else).  We'd certainly be interested in a patch.

thanks for the info, I haven't done much with python yet so I don't know
how soon I could do anything, but my need has reached critical status
(~500 spams/day in addition to ~500 legit mails/day) so I need to take the
time to do _something_ and there aren't many people supporting IMAP at
all.

> > As an additional optimization, instead of running every x min as it
> > currently does the filter could register itself with the server for
> > specific mailboxes and have the server notify it when new
> > mail has arrived and process it immediatly (this also can produce
> > less server load and network traffic then frequent polling for new
> > messages, a win for both load and responsivness)
>
> I presume this is something that Exchange lets you do?  AFAIK this isn't
> something that regular IMAP4 can do, otherwise this would indeed be a better
> way to do it.  If a patch to allow this didn't require too much refactoring
> of the code, I wouldn't have a problem with including this as an option, for
> those people in your situation.  Unless this is something that strict IMAP4
> can handle I wouldn't want it in the main distribution under other
> conditions, though.  In any case, there's certainly no reason why you
> couldn't run a version patched like this yourself.

hmm, I learned of this while working with Cyrus and I thought it was a
standard part of the spec, Ok looking through the RFC it looks like not
all servers send this information without prompting.

David Lang


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan



More information about the Spambayes mailing list