[spambayes-dev] fetchmail replacement

Richie Hindle richie at entrian.com
Mon Jul 28 19:47:41 EDT 2003


[Paul]
> I was after some tips on how I might go about creating a bayesian fetchmail
> replacement.  If it's too hard I would settle for the pop proxy but I want a
> replacement for fetchmail anyway.  My experience with Python is not
> extensive (enough to be dangerous as they say).

Beware that although the pop3proxy will work with fetchmail, there's a
known bug that means you can't use the web interface to train on messages
received via fetchmail.  It'll be fixed in the next day or two (or maybe
tonight if the baby stays asleep 8-)

> I did something similar in Java and just spawned a thread for each pop box I
> wanted to retrieve from (my modest Celeron firewall nearly chokes on this
> long running java process though).  I could presumably use poplib in a
> similar mode but I noticed the pop3proxy appears to use a different
> technique for parallel sessions.
> 
> Any pointers would be appreciated.

Here's an explanation of why the pop3proxy uses async (the "different
technique" you mention) rather than threads.  Forgive any non-sequiturs,
because I've copied it verbatim from an old email:

------------------------------------------------------------------------

 o When we were using pickles, things would certainly explode if two
   threads tried to access the disk pickle at the same time.  (In the
   early days, the POP3 proxy would save the pickle after every training
   operation, so the chances of a conflict were high.)  Now that we're
   focussing on bsddb, that's less of an issue (allegedly - I'd like to
   see definitive proof that bsddb is threadsafe).  [Update: we've
   recently proved that bsddb is *not* threadsafe, at least on Windows]

 o "But you can use thread synchronisation to prevent that" I hear you
   say.  Yes, that's true, but there's an important difference between
   solving the problems of resource contention using async and using
   thread synchronisation: an async program either works or it doesn't -
   it can *never* have contention bugs because it's single-threaded.
   Whereas, a threaded program with a bug in its synchronisation code can
   seem to work but occasionally blow up (or worse, quietly corrupt data)
   with no obvious clue as to why.  You know where you are with async
   (scratching your head, mostly! 8-)

 o async scales better than threads - people are talking about using
   Spambayes as the basis for enterprise-wide filtering, and that could
   mean a lot of simultaneous users.

 o It's easier to debug async code than threaded code, once you understand
   async.  Python debuggers don't tend to handle multithreaded programs
   very well (you effectively end up running one independent debugger per
   thread).

The downside is that understanding async requires an "aha!" moment that
takes a long time to come.  I don't have the time to explain all the
details, but there's a sentence from the documentation for Twisted (a
Python framework that does the same job as async and more) that sums it up
nicely.  It says something like this: "Don't think of your code as the
program, calling Twisted as a library.  Think of Twisted as the program,
and your code as the library."  In other words, the framework is in
charge, and calls into your code when something happens.  Your code
doesn't determine the sequence of events, it just *reacts to* events.
When you start coding with async (or Twisted, or to a certain extent
Win32, GTK, QT and so on) there's a scary loss of control because the flow
of the program is out of your hands - your reaction is that you don't
understand how the program as a whole works.  Then you realise that you
*did* understand it all along, and it's just the nasty implementation
details that are hidden, which is fine.  You have the same amount of
knowledge before and after that "aha!" moment, but your state of mind
changes.  It's probably Zen.

------------------------------------------------------------------------

Not sure whether that helps, but more information is better than less.
8-)

It certainly true to say that the core spambayes classifier is very easy
to integrate into your program, and that Python's email handling
capabilities are very good, so Python and Spambayes should be a good fit
for what you need to do.

-- 
Richie Hindle
richie at entrian.com





More information about the spambayes-dev mailing list