[Mailman-Developers] BounceRunner optimization and problems with VERY LARGE lists

Thu Jan 30 22:26:15 EST 2003

Hi,
I'm new to the list and have been using Mailman 2.1 for about month on a
single list of 138,000 subscribers
(It's a legit opt-in announce only list for my wife's website  that's taken
3 years to grow to this size).

A typical mailing of this list generates over 5,000 bounces, giving
BouncerRunner A LOT of work.  At best, BounceRunner can process these at
20/minute on my 2GHZ P4 Redhat mail server, taking over 4 hours best case to
finish (running at CPU 90% utilization). However, when it is doing all this
work I have the following problems:

(1) The Web gui Membership management page times out and is unusable
(2) If CommandRunner starts processing commands I get "couldn't get list
lock" errors in BounceRunner's log  (Of course losing the Bounce updates to
the list)

After studying the code in the Runner class (and BounceRunner in particular)
I believe I have a solution to these problems but I wanted to get a sanity
check from everyone on the list BEFORE beginning rewriting the code.

Here is a greatly simplified overview of how I understand  BounceRunner
currently processes bounces  Mailman V 2.1 code:
I've highlighted the troublespots in CAPS

While Forever
               (Process all the emails we find in the bounce queue)
               For Every email in queue
                   REREAD list from disk
                   Dequeue the message
                   Extract addresses to bounce
                   LOCK the LIST
                   For Every address in message
                          Register Bounce
                   SAVE the list to disk
                   UNLOCK the list
                If we didn't PROCESS ANY EMAILS on last pass
                      Then  SLEEP for SLEEPTIME
CLEANUP ON EXIT

I believe that these are the troublespots that have been causing me
problems:

(1) The list SAVE is executed once for every bounced email.  For my big
list, that's 13 Megabytes of data written and read back from the disk for
EVERY bounce email. Which is why it takes 2-3 seconds to process an email.

(2) BounceRunner is VERY greedy about the list lock.  The  time "window" for
other processes to acquire a list lock is VERY short when the bounce queue
is filling or full.  In this case, the lock is only open for the time it
takes to extract the addresses from the next email!
In additon, because we ONLY sleep when the QUEUE is empty this behavior can
exist for HOURS on a large list.

Here's my version of the new improved BounceRunner

intialize x to number of bounces to process on each pass
While Forever
         Initialize Python list structure to hold bounces
         (Process x emails in the bounce queue)
         For x emails in queue
              Dequeue the message
              Extract addresses to bounce
              SAVE address and Listname in Python list structure
         If Python List structure contains emails
             For all mailing lists  in Python structure
                     REREAD list from disk
                     LOCK the LIST
                     For all addresses that bounced for this list
                         Register Bounce
                    SAVE the list to disk
                    UNLOCK the list
          SLEEP for SLEEPTIME
CLEANUP on exit

Advantages to this method:

(1) We process a number of bounces before writing out the list reducing I/O
(the real bootleneck) by factor x.  When x is one the algorithm almost
degenerates to the current method

(2) Since we always sleep on each pass it gives other processes (like the
Web gui) a chance to read the list.

(3) By increasing x we control the number of bounces that get processed on
each pass. The time it takes to extract the addresses gives other processes
time to acquire the list lock and avoid "lockout"

(4) Since "in memory" bounce registration is very fast we can do a lot of
them while the list is locked without adding significantly to the already
long lock time on a big list (I believe the I/O is the limiting factor)

Disadvantages:

(1) A larger number of bounces could be lost if we can't acquire the list
lock to update the list.  If desired, we could write the extracted addresses
to a file to allow easier recovery in this situation.  However, since they
are just bounces it's not a huge loss anyway.

(2) The processing time for the larger number of bounces WILL be greater
than the single bounce processed now.  How much more I don't know.  This
will mean that the list will be locked for a longer period on each pass.
However, it will be locked LESS frequently since the bounces can be cleared
from the queue faster.

I'm thinking of a similar strategy for CommandRunner, since that is my other
resource hog, taking 2-5 seconds per subscribe or unsubscribe.

Thoughts? Comments? Suggestions?  I'm interested in any and all responses.