[Mailman-Users] Manipulate mailman in / out queue

Wed Oct 17 06:23:30 CEST 2012

Xueshan Feng wrote:
>
>On Mon, Oct 15, 2012 at 9:35 PM, Mark Sapiro <mark at msapiro.net> wrote:
>
>> This is really more involved than I can explain without a keyboard which I
>> won't have before Tues eve, but there should be only one .bak file or one
>> per slice if the runner is sliced. This is the message currently being
>> processed. All others are ignored by the current runner (they will be
>> "recovered" if the runner is restarted).
>>
>
>This helps a lot already. We do have multiple runners.

Here are the gory details. All the heavy lifting is done by methods of
the Switchboard class defined in Mailman/Queue/Switchbord.py.

Any particular runner is specific to a particular queue or slice of a
queue. The out/ queue is processed by OutgoingRunner. If it isn't
sliced, it processes the whole queue. If it is sliced, there are N
slices.

Note: The filename of a queue entry consists of a time stamp, a '+', a
40 hex digit hash and the extension (.pck or .bak). A slice consists
of (1/N)th of the hash space. E.g., if N = 4, slice 0 is all hashes
with first hex digit = 0, 1, 2 or 3; slice 1 is all hashes with first
hex digit = 4, 5, 6 or 7; slice 2 is all hashes with first hex digit =
8, 9, A or B, and slice 3 is all hashes with first hex digit = C, D, E
or F.

A particular slice of OutgoingRunner initializes its Switchboard
instance once at startup or restart. This creates the queue directory
(qfiles/out/, or whatever queue this runner processes) if necessary,
sets the upper and lower hash bounds for its slice if sliced and
normally, recovers all the .bak files in it's slice. Recovery consists
of incrementing a recovery count in the entry's metadata and renaming
it from *.bak to *.pck. Thus, immediately after (re)starting a runner,
there will be no *.bak files in its slice. The counter is to stop
loops where messages crash the runner. A .bak file will be recovered
at most 3 times and then moved to qfiles/bad/*.psv.

After initialization, a runner first obtains a list of all the .pck
files in its slice, sorted by timestamp so the list is FIFO. It then
processes the list until the list is exhausted, sleeps for a second
and gets a new list and repeats the process. If the new list is empty,
it just sleeps a second and tries again until it gets one or more
entries to process.

Processing consists of renaming the file from *.pck to *.bak,
unpickling it and processing it. If it crashes in processing, it will
recover the .bak file upon restart. Thus, there should never be more
than one .bak file per slice.

>> Note that part of the slowness at this point is due to the size of the out
>> directory.
>
>
>I was able to flush the queue today by moving long lasting *.bak out of the
>way, and at the same time stopped Postfix to allow mailman to process its
>queue. It took about half an hour to process 8000+ messages. If no manual
>intervene, it may take a few hours.
>
>You can address this by stopping Mailman, moving qfiles/out aside, starting
>> Mailman (which should recreate qfiles/out at the first message if not
>> before) and then moving old entries back a few at a time.
>>
>
>I think I've done that before. So moving back files into the queue in
>batches, doesn't have to stop mailman?

First of all, The actual physical size of the queue directory impacts
processing. Every time an entry is added to the queue, and every time
a .pck file is renamed to .bak, the entire physical directory must be
searched to ensure this isn't a duplicate name. Depending on OS
settings, cache sizes and the physical directory size, this may
actually involve multiple disk reads each time. Thus, if the
qfiles/out/ directory has grown large because 8000+ messages were
added to the queue when the runner couldn't handle them (and there may
have been more in the retry/ queue because of SMTP failures), it would
benefit from shrinking. This is accomplished by moving (mv) or
renaming the queue directory itself aside, not just its contents and
then letting the runner recreate it when it starts. Then, if
necessary, move messages back a few at a time so the directory doesn't
grow large again.

>The real operational question here is each time if we have to stop / start
>mailman to move files,  than for large volume queues, it would take a lot
>of manual process. The procedure I have used is:
>
>- stop mailman
>- move queue files or .bak file aside

   Move the whole directory, not the contents.

>- start mailman
>- move some files back, or .bak back into the queue
>(note  files are moved back while mailman is running)

Moving (mv or rename) files back from the same file system while
Mailman is running is fine. When the entry appears in the directory in
this case, the file contents are complete. This is essentially what
Mailman does when it makes a queue entry. Copying (cp) is not good
because there can be a directory entry for the file before its
contents are complete, and a runner could read an incomplete file.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan