[Mailman-Developers] Mailman 2.0 usage notes...
chuq von rospach
chuqui@plaidworks.com
Tue, 10 Apr 2001 13:47:24 -0700
Been whapping at my big mailman machine the last week, because it's been
slowly falling behind and never quite catching up. Unfortunately, it
seems like I've simply hit capacity for now, so I'm looking for ways to
extend that or at least minimize the damage until 2.1's multi-threading
queueing comes online and I can use it (my big problem is the
single-threading....)
One thing that would help would be if the Sendmail module got
productionalized so it could be used instead of SMTPDirect, because then
you could use the -deliverymode=defer option, which you can't on
SMTPDirect because it disables some spamchecking.
But going elbow-deep into the server for a few days under a constant
grinding load has brought forward a few things I thought I'd pass on...
First, there's a problem with the way queues are processed. qrunners
uses:
for file in os.listdir(xxxx)
to read the queue. The order is undefined, but in practice, it's
basically blatting it out how it's stored in the inode. If you're not
overly busy, that's fine. But If you start hitting the point where
you're backing up, it means you process "N" slots into the directory,
then qrunner exits and starts again, and it then re-covers the same
directory slots. Things that go in and don't get processed simply NEVER
get processed, unless the system quiets down enough to let qrunner to
catch up. Qrunner *really* needs to process the queue FIFO.
Unfortunately, teaching qrunner to go FIFO is a bit complicated. you'd
have to pull all of the filenames out, stat them all, and then sort
that. There's a much easier solution, though --
In Mailman/Message.py where the filename si created, mailman uses
time.time() and some other values to create a filename, which is then
converted to hex. the idea is to create a unique filename. But in fact,
time.time() should be unique (to be paranoid, one could grab it and then
check to see if the filename exists and loop), and if you stored the
queue files as "time.time()".msg/.db, then qrunner could sort the queue
trivially, guaranteeing that the oldest messages are always processed in
each queue run.
This creates some interesting race conditions, where an item is added to
the queue and never comes out -- which causes people to think it's lose
and repost it, adding to the queue clogging, which slows stuff,
which... until Saturday, when they all go home, the system slows down,
and qrunner catches up and posts three day old messages.... And since
the system seems to be working just fine, tracking it down is fun...
This also led to finding a problem in the bounce processing area. The
bouncer works pretty well, but it has one flaw for which I don't have an
easy answer. If I'm subscribed as "chuq@plaidworks.com", but for some
reason the bounce comes back as "chuq@mail.plaidworks.com" (or vice
versa, or if I'm forwarding mail in some other name), the bouncer will
catch the bounce and try to process it, not find me, and log it as a
"user not subscribed". Unless the admin is somehow post-processing the
bounce logs, though, that bounce is never REALLY handled, so it bounces
indefinitely, and the admin never knows. This also over time encourages
queue clogging and wastes bandwidth and CPU and all of that -- and
worse, list admins and site admins probably think everything is fine
because the bouncing system is working and these "not a member" bounces
are never reported anywhere.
On the other, other hand, you probably don't want to just blat all these
at admins, since they'll tune them out. But some kind of nightly report
of some sort is the tradeoff I'd make, I think, so admins could see
continual bounce problems that need to be manually investigated. And I
strongly recommend all site admins watch the bounce logs and look for
these "missing" bounces, so they can be manually tracked. I found on my
busy site this made a HUGE difference in my queue backlogs, too; these
things were silently contributing a significant amount of traffic to the
queue system and exacerbating my capacity issues.
I really think the qrunner issue needs to be dealt with; it only shows
up on fairly busy sites, but it's a definite bug for folks like me. The
bouncer issue is less nasty, but in a "good citizens clean up their
trash" attitude, I think we need to at least make sure list/site admins
are aware of these bouncers, unless someone can figure out a way to
automate fixing them (this is a place where VERP type things could help,
but I'm not going there, honest... giggle)
chuq
--
Chuq Von Rospach, Internet Gnome <http://www.chuqui.com>
[<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>]
Yes, yes, I've finally finished my home page. Lucky you.
The first rule of holes: If you are in one, stop digging.