[Mailman-Users] Messages got stuck in "in" queue due to one "bad"message

Mark Sapiro msapiro at value.net
Wed Feb 21 00:37:12 CET 2007


Xueshan Feng wrote:
>
>We are running mailman 2.1.9. Recently we had a problem that mailman's
>Incoming qrunner died hard, which caused messages accumulated under
>qfiles/in directory.  Restarting mailman didn't help. We finally
>identified a message on top of the queue, moved it aside, restarted
>mailman again, then mail will start flow. 
>
>Putting that message back to the in queue will trigger the problem. The
>sender actually tried to send it again, and it caused the same problem.
>
>Before I got into the details of mailman logs, I'd like to ask:
>
>1. What's the best way to detect "stalled" condition like this? We do
>monitor qrunner processes but in this case it didn't help because the
>qrunners were still running. Now I put in a script to monitor the number
>of files in "in" queue, and if it reaches a threshold, sends alert. 


Actually, If I understand the situation correctly, IncomingRunner will
not be running. It will die and be restarted, but after the 10th
restart/die, it won't restart again (but this seems incorrect - see
below).


>2. Why this message didn't get moved to shunt directory?


I don't know why it didn't shunt. From the traceback, the message
should have shunted and that should have been the end of it. Also, the
problem with decode_rfc2231 in email.Utils that seems to be the
underlying issue here was fixed in email 2.5.8 which shipped with
Mailman 2.1.9 and should be in Mailman's pythonlib directory. Your
traceback says you are getting the email library from your Python 2.3
installation. This is not correct and should not happen if Mailman is
properly installed.

If I understand the actual problem with the message, it is caused by an
error in email 2.5.7 and earlier and is precipitated by a message with
an apostrophe (') in the subject.


>3. Can the mailman recover itself without human intervention? 


I don't understand why the message didn't shunt. I also don't
understand why IncomingRunner was restarted more than 10 times. Have
you changed

MAX_RESTARTS = 10

in bin/mailmanctl?

Was there a second traceback with only one date/time header following
the one you report? If so, what was it?

Have you changed Mailman/Queue/Runner.py in any way?

>
>Details are followed:


More comments below:


>The following error message will repeat a couple of times in error log
>when this message was processed: 
>
>
>Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last):
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/var/lib/mailman/bin/qrunner", line 278, in ?
>Feb 20 07:55:53 2007 qrunner(25722):      main()
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/var/lib/mailman/bin/qrunner", line 238, in main
>Feb 20 07:55:53 2007 qrunner(25722):      qrunner.run()
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run
>Feb 20 07:55:53 2007 qrunner(25722):      filecnt = self._oneloop()
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop
>Feb 20 07:55:53 2007 qrunner(25722):      msg, msgdata =
>self._switchboard.dequeue(filebase)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue
>Feb 20 07:55:53 2007 qrunner(25722):      msg =
>email.message_from_string(msg, Message.Message)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string


This and subsequent email modules should come from
/usr/lib/mailman/pythonlib/email, not from /usr/lib/python2.3/email.


>Feb 20 07:55:53 2007 qrunner(25722):      return Parser(_class,
>strict=strict).parsestr(s)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Parser.py", line 75, in parsestr
>Feb 20 07:55:53 2007 qrunner(25722):      return
>self.parse(StringIO(text), headersonly=headersonly)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Parser.py", line 64, in parse
>Feb 20 07:55:53 2007 qrunner(25722):      self._parsebody(root, fp,
>firstbodyline)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody
>Feb 20 07:55:53 2007 qrunner(25722):      msgobj = self.parsestr(part)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Parser.py", line 75, in parsestr
>Feb 20 07:55:53 2007 qrunner(25722):      return
>self.parse(StringIO(text), headersonly=headersonly)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Parser.py", line 64, in parse
>Feb 20 07:55:53 2007 qrunner(25722):      self._parsebody(root, fp,
>firstbodyline)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody
>Feb 20 07:55:53 2007 qrunner(25722):      boundary =
>container.get_boundary()
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Message.py", line 743, in get_boundary
>Feb 20 07:55:53 2007 qrunner(25722):      boundary =
>self.get_param('boundary', missing)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Message.py", line 608, in get_param
>Feb 20 07:55:53 2007 qrunner(25722):      for k, v in
>self._get_params_preserve(failobj, header):
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve
>Feb 20 07:55:53 2007 qrunner(25722):      params =
>Utils.decode_params(params)
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Utils.py", line 337, in decode_params
>Feb 20 07:55:53 2007 qrunner(25722):      charset, language, value =
>decode_rfc2231(EMPTYSTRING.join(value))
>Feb 20 07:55:53 2007 qrunner(25722):   File
>"/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231
>Feb 20 07:55:53 2007 qrunner(25722):      charset, language, s = parts
>Feb 20 07:55:53 2007 qrunner(25722): ValueError :  unpack list of wrong
>size


Is there a second, different traceback immediately following this?



>The qrunner log shows:
>Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit
>(pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started.
>Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started.
>Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started.
>Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started.
>Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started.
>Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started.
>Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started.
>Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started.
>Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started.
>Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]
>Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started.
>Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit
>(pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
>[restarting]


Why more than 10 restarts?

There is a post at
<http://mail.python.org/pipermail/mailman-developers/2007-February/019426.html>
about a somewhat different problem with a similar result. There is a
patch attached to that post which may help this situation, but the
real issues here are two: Why isn't your Mailman using pythonlib, and
Why did IncomingRunner die and not shunt the message?

-- 
Mark Sapiro <msapiro at value.net>       The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list