[Mailman-Users] Problems with multi-machine slicing

Sat May 24 16:52:00 CEST 2014

After doing some upgrades, I noticed yesterday that my multi-machine 
setup is no longer properly slicing the queue between machines.  I 
probably missed something, but after going through all my notes on the 
setup I cannot figure out what the problem in.  Hopefully someone else 
can spot the issue?

I have four mail servers.  Three of them are supposed to slice the queue 
between them, and the fourth machine is set as a backup to process any 
remaining messages after 2 minutes.  On the three slice machines, I have 
patched mailmanctl as:
----------
def start_all_runners():
     kids = {}
 >>>
     for qrname, count, machine, nummachines in mm_cfg.QRUNNERS:
         for slice in range(machine, count, nummachines):
<<<
             # queue runner name, slice, numslices, restart count
             info = (qrname, slice, count, 0)
             pid = start_runner(qrname, slice, count)
             kids[pid] = info
     return kids
----------

Each of these machines has a QRUNNERS section added to mm_cfg.py which 
defines the slice of each machine --  3,0,3  /   3,1,3  / 3,2,3
and contains the line: QRUNNER_MESSAGE_IS_OLD_DELAY = None

On the fourth (backup) machine, I have patched Switchboard.py as:
----------
             if ext <> extension:
                 continue
             when, digest = filebase.split('+')
 >>>
             now = time.time()
             age = now - float(when)
             # Only process defined 'old' entries.
             if not (
                 hasattr(mm_cfg, 'QRUNNER_MESSAGE_IS_OLD_DELAY') and
                 mm_cfg.QRUNNER_MESSAGE_IS_OLD_DELAY and
                 age > mm_cfg.QRUNNER_MESSAGE_IS_OLD_DELAY):
                 continue
<<<
             # Throw out any files which don't match our bitrange. BAW: test
             # performance and end-cases of this algorithm.  MAS: both
             # comparisons need to be <= to get complete range.
----------

On this fourth machine I have added to mm_cfg.py: 
QRUNNER_MESSAGE_IS_OLD_DELAY = minutes(2)
This machine has NOT had the slices patch added to mailmanctl, so there 
is no QRUNNERS section in mm_cfg.py.

OK, so if I only have the backuo machine running, mailman will deliver 
my test message after 2 minutes.  That part works fine. However with the 
three slice machines running, the first machine (3,0,3) sends ALL of the 
messages out immediately.  If I shut down the first machine and leave 
the other two running, no messages are sent out until after the 2-minute 
period, then the backup machine sends them.  In other words, the queue 
is not being sliced, and only the first machine is capable of sending 
out list messages.

I have referenced back to the original article on this subject: 
https://mail.python.org/pipermail/mailman-users/2008-March/060753.html
but it appears I did the correct changes.  Has something changed in 
newer versions of mailman that now prevent this technique from working 
the same way? Or was there something more to getting slicing to work 
that was not mentioned in that article?