[Mailman-Developers] First big Mailing

J C Lawrence claw@kanga.nu
Thu, 10 Jan 2002 23:07:10 -0800


On Thu, 10 Jan 2002 22:08:19 -0800 
Marc Perkel <marc@perkel.com> wrote:

> I'm doing my first big mailing with Mailman/Exim to deliver
> effector for the Electronic Frontier Foundation. The list is about
> 20,000 names.

I'm going to do some guessing here, so make allowances.

> Anyhow - started out moving right along, maybe too well -
> saturated the T1 pretty quick and the system slowed down. But the
> email logs were really going fast. This went on for about 45
> minutes.

Translation: A message came in, was approved (automatically or by
moderation) and moved to ~mailman/qfiles along with a copy of the
the distribution list.  qunner ran a little while later (cron job),
and started delivering the broadcast to the MTA (thus all the
activity).

> During this time I saw a number of errors. Messages indicating the
> I had too many files open (running tail on the exim logs). A few
> messages that looked like something couldn't open something.db or
> something like that.

Not good.  Very not good.  Loose guess:  You have configured Exim
for more queue runners, parallel deliveries, and/or simultaneous
incoming deliveries that Exim exceeded either your kernel's maximum
number of file handles per process, or the total number of file
handles in the system.

You need to track this down and fix it.  Now.  Before you do
anything else.

It may be enough (for the short term) to just reconfigure Exim to
use a smaller number of processes etc, but that's a stopgap, not a
fix.

> The delivery slowed down as it it were done. System load dropped
> back to low normal levels. 

This sounds like qrunner processes being forked by cron, each trying
to deliver more of your 20K messages to Exim.  Qrunner has an
internal timeout (15mins IIRC) after which it will be reaped and a
new process forked by the next cron pass.

> This lasted a while - then things started back up again really
> delivering messages. These deliveries come in spurts.

Which would explain the above bit.

> Anyhow - even though Exim is delivering other email. Messages sent
> to mailman are getting "stuck". 

Odds are good that a qrunner process was ungracefully reaped
resulting in a stale lock file in ~mailman/locks.  As a result
subsequent qrunner processes are doing nothing, waiting for the lock
to timeout.  

Fix:

  Check that there are no qrunner processes running.  If so, delete
  ~mailman/locks/*

Notes:

  You *REALLY* need to fix your file handle problem.  Its not
  unlikely that that's is a fundamental cause of your problems.

  qrunner is responsible for all motion of mail thru the Mailman
  system, in receipt, moderation, and broadcasting.  If qrunner is
  locked, nothing will happen until it is unlocked.

> It's as if nothing in mailman is working. I see messages being
> sent to mailman. But mailman isn't responding. 

Nope.  They're being stashed in ~mailman/data, awaiting qrunner to
Do The Right Thing (deliver to list, process as bounce, broadcast,
etc.

> I don't know if something is holding these messages and this is
> waiting in some queue - or if Mailman has crashed and is eating
> messages - or someing is corrupt or locked or overloaded or what.

See above.

Again:  FIX YOUR FILE HANDLE PROBLEM FIRST!  

Until you do you can lose mail and have inexplicable impossible to
debug problems.

Notes:

  If you can spare the systems, the first thing you'll need to do is
  separate the MTA that is handling final delivery from your Mailman
  machine.  Bounce processing can and will *really* screw with your
  efficiencies and load with lists of that size.  
  
  Recommended architecture for scaling Mailman in your sort of
  situation is to have your Mailman system deliver all outbound
  mail to a smarthost (either via a smarthost rule on your MTA, or
  directly via SMTP config in Mailman).  Ensure that the MXing for
  bounces will *NOT* go back to your smarthost, but will go directly
  to your Mailman system,

  This allows your smarthost to be tuned for what you need it to do:
  handle outbound deliveries efficiently, and allows your Mailman
  system to remain responsive (and under less load as there's no
  local overburdened MTA queue) for processing inbound bounces etc.

  Set SMTP_MAX_RCPTS in ~mailman/Mailman/mm_cfg.py to something
  reasonably large.  Suggest something in the 50 - 100 range.  Do
  not go any larger.  This may help with temporarily resolving your
  file handle problem.  It will also decrease system load in general
  and help smooth things along a bit.  Later, when everything is
  known working, you can start tuning for performance and look at
  dropping SMTP_MAX_RCPTS down to around 5 (usually the sweet spot).

> Anyhow - I'd like some general feedback on what might be
> happening. The newsletter contains an important story about Norway
> inditing Jon Johansen criminally. He's the guy who wrote the DVD
> code.

Yeah, I read it.  

ObOffer: 

  If you would like some help offloading your mail traffic, I'm
  willing to smarthost a chunk of it for you (will need to verify
  with my upstreams).  Basic idea would be to smarthost route a
  couple TLDs to me for final delivery (I've got a couple T3's so I
  should be able to take a fair percentage of your load).

-- 
J C Lawrence                
---------(*)                Satan, oscillate my metallic sonatas. 
claw@kanga.nu               He lived as a devil, eh?		  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.