[Mailman-Users] mailman gets stuck, stops sending messages

Steven J. Owens mailman-user at darksleep.com
Tue Nov 16 10:46:48 CET 2004


Hi,

Mailman seems to randomly stop sending messages.  

I'm running:

     debian woody
     postfix 2.1.3-1
     mailman 2.1.5
     python 2.3.4-1

My only real suspect is that either mailmanctl isn't being started, or
that bad lock files are being built up in the lock directory.

Restarting mailman with "/etc/init.d/mailman restart" or
"/var/lib/mailman/bin/mailmanctl restart" seems to get things working
again, but I'm kind of bothered by this, since I have no idea why it
stopped, and I tend to not notice it's stopped until a day or two go
by (most of my lists are intermittent traffic, 30-40 users). 

Should I put in a cron job to restart mailman every hour or so?
What else can I do?


Background:

This problem started sometime in the past year mailman.  Before that,
for a couple years, mailman worked reliably and we were quite happy
with it.

It's entirely possible that the problem may have happened after an
upgrade.  It happens intermittently, so it's hard to tell.

This last time around, I spent several hours digging through list
archives and reading FAQs, trying to figure out what's going on.
I followed FAQ 3.14, Troubleshooting: no mail going out to list
members:

http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq03.014.htp

There are various things that are different about a debian apt mailman
installation:

- the "mailman" account is named "lists"
- there is no /home/lists or /home/mailman
- mailman's files are located in:
  /var/lib/mailman
  /usr/lib/mailman
  /usr/share/mailman

My first thought was to check the log files, which on debian are in
/var/log/mailman (I think this is soft-linked from /var/lib/mailman).
However, the log files showed nothing.  The last successful message
was on Nov 11, the error log file's last line was from Nov 1.

I have plenty of file space, checked with "df -h", got 1.5 GB
available.

I have plenty of inodes, checked with "df -i", got 1.5M IFree.


0) check_perms showed all sorts of weird issues, but that may be
related to the general weirdness of the debian apt installation of
mailman.  

     list at darksleep:/etc$ /var/lib/mailman/bin/check_perms
     /var/lib/mailman/mail bad group (has: root, expected list)
     /var/lib/mailman/Mailman bad group (has: root, expected list)
     /var/lib/mailman/cron bad group (has: root, expected list)
     /var/lib/mailman/bin bad group (has: root, expected list)
     /var/lib/mailman/scripts bad group (has: root, expected list)
     /var/lib/mailman/logs bad group (has: root, expected list)
     /var/lib/mailman/templates bad group (has: root, expected list)
     /var/lib/mailman/cgi-bin bad group (has: root, expected list)
     Problems found: 8
     Re-run as list (or root) with -f flag to fix

However, su-ing to root and re-running check_perms with -f did not fix
the problems (though it reported that it was).

So I skipped this step and checked the others.

1) Cron:

darksleep:/var/lib/mailman# ps -aux |grep cron |grep -v grep
root      8550  0.0  0.0  1756  500 ?        S    Sep20   0:04 /usr/sbin/cron

Loks like Cron's running.

2) Aliases

The aliases are all there in /etc/aliases, somebody on #postfix told
me I should also run:

postalias /etc/aliases

Which I did, but still no change.

3) Smrsh, skipped this step, since I'm not using redhat or sendmail.

4) Interface, again, I'm not running sendmail, and in any event I'm
pretty sure that the MTA's okay, since it gets used a fair bit every
day and has shown no sign of problems.

5) qrunner

su-ing to "lists" and running "crontab -l" showed no jobs at all,
but I found the mailman files in:
     /etc/cron.d/mailman

In any event, both /var/lib/mailman/bin/version and "dpkg -l mailman"
says I'm running 2.1.5 (dpkg says 2.1.5.3), and this section of the
FAQ says:

   If you are running Mailman 2.1.x then the qrunners are daemons that
   are started by $prefix/bin/mailmanctl, which itself may be being 
   run via a 'mailman' startup script. This is described in the 
   INSTALL document for the version of MM you are running.

I can't find any INSTALL document with:
     dpkg -L mailman | fgrep INSTALL

(Warning, don't do "dpkg -L mailman<enter>", there are over 3000 files
in the mailman package :-).

I can't see a mailmanctl daemon with:
     ps -aux| grep mailmanctl |grep -v grep

I'm not sure what's going on.  At first I was excited, because I
figured the absence of a mailmanctl process meant that was the
problem.  When I did "mailmanctl start", messages waiting in the 
queue were delivered, and they appear to still be getting through
now, a couple hours later.

However, on a closer re-reading, of the above paragraph, it doesn't
really say that there's supposed to be a mailmanctl process running.
It doens't say much of anything, really.

There's nothing about mailmanctl in /etc/cron.d/mailman.
There's nothing about mailmanctl in /var/log/mailman/*.

6) Locks

There are definitely lock files in /var/lib/mailman/locks, and they
definitely have process IDs that don't show up in "ps -aux".  But I'm
not sure that's the _problem_, since things start and messages go
through, even with the lock files there.

Nevertheless, I removed the old lock files, since they all date from
May, September, etc.

7) Logs

The only thing I can find in the logs that looks suspicious is:

qrunner:
----------------------------------------------------------------------
Nov 11 16:19:27 2004 (1060) OutgoingRunner qrunner started.
Nov 11 16:19:27 2004 (1061) IncomingRunner qrunner started.
Nov 11 18:30:26 2004 (1060) OutgoingRunner qrunner caught SIGTERM.  Stopping.
Nov 11 18:30:26 2004 (1060) OutgoingRunner qrunner exiting.
Nov 11 18:30:26 2004 (1061) IncomingRunner qrunner caught SIGTERM.  Stopping.
Nov 11 18:30:33 2004 (1061) IncomingRunner qrunner exiting.
----------------------------------------------------------------------

locks:
----------------------------------------------------------------------
Nov 10 16:37:14 2004 (1606) 2004-November-thread.lock lifetime has expired, breaking
Nov 10 16:37:14 2004 (1606)   File "/var/lib/mailman/bin/qrunner", line 270, in?
Nov 10 16:37:14 2004 (1606)     main()
Nov 10 16:37:14 2004 (1606)   File "/var/lib/mailman/bin/qrunner", line 230, inmain
Nov 10 16:37:14 2004 (1606)     qrunner.run()
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 70, in run
Nov 10 16:37:14 2004 (1606)     filecnt = self._oneloop()
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 111, in _oneloop
Nov 10 16:37:14 2004 (1606)     self._onefile(msg, msgdata)
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 167, in _onefile
Nov 10 16:37:14 2004 (1606)     keepqueued = self._dispose(mlist, msg, msgdata)
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Queue/ArchRunner.py", line 73, in _dispose
Nov 10 16:37:14 2004 (1606)     mlist.ArchiveMail(msg)
Nov 10 16:37:14 2004 (1606)   File "/var/lib/mailman/Mailman/Archiver/Archiver.py", line 215, in ArchiveMail
Nov 10 16:37:14 2004 (1606)     h.processUnixMailbox(f)
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 569, in processUnixMailbox
Nov 10 16:37:14 2004 (1606)     self.add_article(a)
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 615, in add_article
Nov 10 16:37:14 2004 (1606)     article.parentID = parentID = self.get_parent_info(arch, article)
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/pipermail.py", line 649, in get_parent_info
Nov 10 16:37:14 2004 (1606)     if parentID and not self.database.hasArticle(archive, parentID):
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/HyperDatabase.py", line 273, in hasArticle
Nov 10 16:37:14 2004 (1606)     self.__openIndices(archive)
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/HyperDatabase.py", line 251, in __openIndices
Nov 10 16:37:14 2004 (1606)     t = DumbBTree(os.path.join(arcdir, archive + '-' + i))
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/HyperDatabase.py", line 61, in __init__
Nov 10 16:37:14 2004 (1606)     self.lock()
Nov 10 16:37:14 2004 (1606)   File "/usr/lib/mailman/Mailman/Archiver/HyperDatabase.py", line 77, in lock
Nov 10 16:37:14 2004 (1606)     self.lockfile.lock()
Nov 10 16:37:14 2004 (1606)   File "/var/lib/mailman/Mailman/LockFile.py", line 306, in lock
Nov 10 16:37:14 2004 (1606)     important=True)
Nov 10 16:37:14 2004 (1606)   File "/var/lib/mailman/Mailman/LockFile.py", line 416, in __writelog
Nov 10 16:37:14 2004 (1606)     traceback.print_stack(file=logf)
Nov 12 08:00:02 2004 (20710) beehiverefugees.lock lifetime has expired, breaking 
Nov 12 08:00:02 2004 (20710)   File "/usr/lib/mailman/cron/checkdbs", line 178, in ?
Nov 12 08:00:02 2004 (20710)     main()
Nov 12 08:00:02 2004 (20710)   File "/usr/lib/mailman/cron/checkdbs", line 84, in main
Nov 12 08:00:02 2004 (20710)     mlist = MailList.MailList(name)
Nov 12 08:00:02 2004 (20710)   File "/var/lib/mailman/Mailman/MailList.py", line 126, in __init__
Nov 12 08:00:02 2004 (20710)     self.Lock()
Nov 12 08:00:02 2004 (20710)   File "/var/lib/mailman/Mailman/MailList.py", line 159, in Lock
Nov 12 08:00:02 2004 (20710)     self.__lock.lock(timeout)
Nov 12 08:00:02 2004 (20710)   File "/var/lib/mailman/Mailman/LockFile.py", line 306, in lock
Nov 12 08:00:02 2004 (20710)     important=True)
Nov 12 08:00:02 2004 (20710)   File "/var/lib/mailman/Mailman/LockFile.py", line 416, in __writelog
Nov 12 08:00:02 2004 (20710)     traceback.print_stack(file=logf)
----------------------------------------------------------------------


-- 
Steven J. Owens
puff at darksleep.com

"I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt." - http://darksleep.com/notablog




More information about the Mailman-Users mailing list