[Mailman-Users] Mailman stuck : mailmanctl dead with messages in /qfiles/in

Jérôme jerome at jolimont.fr
Tue May 1 02:19:21 CEST 2012


Hi.

Thanks for answering.

Mon, 30 Apr 2012 16:15:03 -0700
Mark Sapiro a écrit:

> > 2/ Cron/mailmanctl
> > 
> > ps auxww| grep mailmanctl |grep -v grep
> > -> Nothing.
> 
> How about
> 
> ps auxww| grep qrunner |grep -v grep

Nothing either.
 
> > 7/ Locks
> > 
> > /var/lib/mailman/locks -> /var/lock/mailman
> > 
> > ll /var/lock/mailman
> > total 0
> 
> It appears that some process or person is stopping Mailman.

OK. Need to figure out which.
 
> > 8/ Logs
> > 
> > /var/log/mailman/error :
> > Apr 30 03:16:21 2012 mailmanctl(11685): No child with pid: 17093 
> > Apr 30 03:16:21 2012 mailmanctl(11685): [Errno 3] No such process 
> > Apr 30 03:16:21 2012 mailmanctl(11685): Stale pid file removed.
> 
> 
> How about /var/log/mailman/qrunner ?

Each day, I have something like this :
Apr 28 03:16:33 2012 (17099) OutgoingRunner qrunner caught SIGHUP.  Reopening
logs. Apr 28 03:16:33 2012 (17094) ArchRunner qrunner caught SIGHUP.
Reopening logs. Apr 28 03:16:33 2012 (17097) IncomingRunner qrunner caught
SIGHUP.  Reopening logs. Apr 28 03:16:33 2012 (17093) Master watcher caught
SIGHUP.  Re-opening log files. Apr 28 03:16:34 2012 (17095) BounceRunner
qrunner caught SIGHUP.  Reopening logs. Apr 28 03:16:34 2012 (17101)
RetryRunner qrunner caught SIGHUP.  Reopening logs. Apr 28 03:16:34 2012
(17096) CommandRunner qrunner caught SIGHUP.  Reopening logs. Apr 28 03:16:34
2012 (17098) NewsRunner qrunner caught SIGHUP.  Reopening logs. Apr 28
03:16:34 2012 (17100) VirginRunner qrunner caught SIGHUP.  Reopening logs.

The day it stopped, I got this :
Apr 29 03:16:29 2012 (17099) OutgoingRunner qrunner caught SIGHUP.  Reopening
logs. Apr 29 03:16:29 2012 (17094) ArchRunner qrunner caught SIGHUP.
Reopening logs. Apr 29 03:16:29 2012 (17097) IncomingRunner qrunner caught
SIGHUP.  Reopening logs. Apr 29 03:16:29 2012 (17093) Master watcher caught
SIGHUP.  Re-opening log files. Apr 29 03:16:29 2012 (17097) IncomingRunner
qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012 (17099)
OutgoingRunner qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012
(17097) IncomingRunner qrunner exiting. Apr 29 03:16:29 2012 (17094)
ArchRunner qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012 (17099)
OutgoingRunner qrunner exiting. Apr 29 03:16:29 2012 (17094) ArchRunner
qrunner exiting. Apr 29 03:16:29 2012 (17096) CommandRunner qrunner caught
SIGHUP.  Reopening logs. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner
caught SIGHUP.  Reopening logs. Apr 29 03:16:29 2012 (17095) BounceRunner
qrunner caught SIGHUP.  Reopening logs. Apr 29 03:16:29 2012 (17098)
NewsRunner qrunner caught SIGHUP.  Reopening logs. Apr 29 03:16:29 2012
(17098) NewsRunner qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012
(17095) BounceRunner qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012
(17096) CommandRunner qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012
(17101) RetryRunner qrunner caught SIGTERM.  Stopping. Apr 29 03:16:29 2012
(17100) VirginRunner qrunner caught SIGHUP.  Reopening logs. Apr 29 03:16:29
2012 (17096) CommandRunner qrunner exiting. Apr 29 03:16:29 2012 (17098)
NewsRunner qrunner exiting. Apr 29 03:16:29 2012 (17095) BounceRunner qrunner
exiting. Apr 29 03:16:29 2012 (17100) VirginRunner qrunner caught SIGTERM.
Stopping. Apr 29 03:16:29 2012 (17101) RetryRunner qrunner exiting.
Apr 29 03:16:29 2012 (17100) VirginRunner qrunner exiting.

Sorry for the mess, here. But I think you get the idea.

Seems to happen during a cron job.

Bug reports that could be related :
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=505638
https://bugs.launchpad.net/mailman/+bug/265855

> > modified 
> > /var/lib/mailman/Mailman/Handlers/SMTPDirect.py
> > to add
> > self.__conn.set_debuglevel(1)
> 
> And yet you are not logging any smtp debugging in Mailman's error log.
> There should be copious log information for every outgoing message.

There was. But it stopped. Last message for which I do have a lot of info is
on Apr 22, one week before mailman stopped sending messages.

-rw-rw-r-- 1 list list      198 Apr 30 03:16 /var/log/mailman/error
-rw-rw-r-- 1 list list        0 Apr 22 03:16 /var/log/mailman/error.1
-rw-rw-r-- 1 list list        0 Apr 15 03:16 /var/log/mailman/error.2
-rw-rw-r-- 1 list list 36541617 Apr 22 01:59 /var/log/mailman/error.3

Should there be anything relevant in there ?

> > Configuration
> > -------------
> > 
> > Not sure this is useful, but 
> > /etc/mailman/mm_cfg.py contains
> > MTA='LocalPostfix'
> 
> The above line should cause significant problems when attempting to
> create or remove lists. it MUST be one of
> 
> MTA = 'Postfix'
> MTA = 'Manual'
> MTA = None
> 
> 'Postfix' means generate aliases and virtual-mailman files for Postfix.
> 'Manual' means display the necessary aliases
> None means don't do anything with aliases when lists are created/removed.

I configured mailman 3 years ago. I don't remember everything but it comes
from here :
http://isp-control.net/documentation/howto/mail/setup_mailman

Is it such a bad idea ?

I suppose it is unrelated, anyway.

Good thing is there is a relatively recent bug opened on debian that might be
closed if we managed to rootcause and solve this.

I just did a little bit of cleanup tonight, after I realized the server was
almost full. At least the partition that hosts mailman queues and logs. Would
we see something specific in case of lack of space ?

Thank you for your help.

-- 
Jérôme


More information about the Mailman-Users mailing list