[Mailman-Users] Messages got stuck in "in" queue due to one "bad"message
Xueshan Feng
sfeng at stanford.edu
Wed Feb 21 04:02:02 CET 2007
On Tue, 2007-02-20 at 15:37 -0800, Mark Sapiro wrote:
> Xueshan Feng wrote:
> >
> >We are running mailman 2.1.9. Recently we had a problem that mailman's
> >Incoming qrunner died hard, which caused messages accumulated under
> >qfiles/in directory. Restarting mailman didn't help. We finally
> >identified a message on top of the queue, moved it aside, restarted
> >mailman again, then mail will start flow.
> >
> >Putting that message back to the in queue will trigger the problem. The
> >sender actually tried to send it again, and it caused the same problem.
> >
> >Before I got into the details of mailman logs, I'd like to ask:
> >
> >1. What's the best way to detect "stalled" condition like this? We do
> >monitor qrunner processes but in this case it didn't help because the
> >qrunners were still running. Now I put in a script to monitor the number
> >of files in "in" queue, and if it reaches a threshold, sends alert.
>
>
> Actually, If I understand the situation correctly, IncomingRunner will
> not be running. It will die and be restarted, but after the 10th
> restart/die, it won't restart again (but this seems incorrect - see
> below).
The qrunner log shows they died, but if I do ps -ef |grep
IncomingRunner, they are still there.
>
>
> >2. Why this message didn't get moved to shunt directory?
>
>
> I don't know why it didn't shunt. From the traceback, the message
> should have shunted and that should have been the end of it. Also, the
> problem with decode_rfc2231 in email.Utils that seems to be the
> underlying issue here was fixed in email 2.5.8 which shipped with
> Mailman 2.1.9 and should be in Mailman's pythonlib directory. Your
> traceback says you are getting the email library from your Python 2.3
> installation. This is not correct and should not happen if Mailman is
> properly installed.
You nailed it!
We indeed re-packaged Mailman with a lot of Stanford's own patches. When
we upgrade from 2.1.8 to 2.1.9, I missed Mailman's own pythonlib/email
installation in Debian rule file.
I just re-packaged it and tested the new package. The message that
caused the problem now is accepted without a problem!
>
> >3. Can the mailman recover itself without human intervention?
> I don't understand why the message didn't shunt. I also don't
> understand why IncomingRunner was restarted more than 10 times. Have
> you changed
>
> MAX_RESTARTS = 10
>
> in bin/mailmanctl?
No that's not changed. It is still 10.
>
> Was there a second traceback with only one date/time header following
> the one you report? If so, what was it?
Yes, there were more similar traceback in the logs. There are 10
actually.
>
> Have you changed Mailman/Queue/Runner.py in any way?
Not that program.
>
> >
> >Details are followed:
>
>
> More comments below:
>
>
> >The following error message will repeat a couple of times in error log
> >when this message was processed:
> >
> >
> >Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last):
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/var/lib/mailman/bin/qrunner", line 278, in ?
> >Feb 20 07:55:53 2007 qrunner(25722): main()
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/var/lib/mailman/bin/qrunner", line 238, in main
> >Feb 20 07:55:53 2007 qrunner(25722): qrunner.run()
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run
> >Feb 20 07:55:53 2007 qrunner(25722): filecnt = self._oneloop()
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop
> >Feb 20 07:55:53 2007 qrunner(25722): msg, msgdata =
> >self._switchboard.dequeue(filebase)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue
> >Feb 20 07:55:53 2007 qrunner(25722): msg =
> >email.message_from_string(msg, Message.Message)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string
>
>
> This and subsequent email modules should come from
> /usr/lib/mailman/pythonlib/email, not from /usr/lib/python2.3/email.
>
>
> >Feb 20 07:55:53 2007 qrunner(25722): return Parser(_class,
> >strict=strict).parsestr(s)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Parser.py", line 75, in parsestr
> >Feb 20 07:55:53 2007 qrunner(25722): return
> >self.parse(StringIO(text), headersonly=headersonly)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Parser.py", line 64, in parse
> >Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp,
> >firstbodyline)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody
> >Feb 20 07:55:53 2007 qrunner(25722): msgobj = self.parsestr(part)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Parser.py", line 75, in parsestr
> >Feb 20 07:55:53 2007 qrunner(25722): return
> >self.parse(StringIO(text), headersonly=headersonly)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Parser.py", line 64, in parse
> >Feb 20 07:55:53 2007 qrunner(25722): self._parsebody(root, fp,
> >firstbodyline)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody
> >Feb 20 07:55:53 2007 qrunner(25722): boundary =
> >container.get_boundary()
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Message.py", line 743, in get_boundary
> >Feb 20 07:55:53 2007 qrunner(25722): boundary =
> >self.get_param('boundary', missing)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Message.py", line 608, in get_param
> >Feb 20 07:55:53 2007 qrunner(25722): for k, v in
> >self._get_params_preserve(failobj, header):
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve
> >Feb 20 07:55:53 2007 qrunner(25722): params =
> >Utils.decode_params(params)
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Utils.py", line 337, in decode_params
> >Feb 20 07:55:53 2007 qrunner(25722): charset, language, value =
> >decode_rfc2231(EMPTYSTRING.join(value))
> >Feb 20 07:55:53 2007 qrunner(25722): File
> >"/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231
> >Feb 20 07:55:53 2007 qrunner(25722): charset, language, s = parts
> >Feb 20 07:55:53 2007 qrunner(25722): ValueError : unpack list of wrong
> >size
>
>
> Is there a second, different traceback immediately following this?
>
>
>
> >The qrunner log shows:
> >Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started.
> >Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started.
> >Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
>
>
> Why more than 10 restarts?
Hmm, that's 11. I don't know. The following line followed the 11th
restarting:
Feb 20 07:55:55 2007 (27933) Qrunner IncomingRunner reached maximum
restart limit of 10, not restarting.
>
> There is a post at
> <http://mail.python.org/pipermail/mailman-developers/2007-February/019426.html>
> about a somewhat different problem with a similar result. There is a
> patch attached to that post which may help this situation, but the
> real issues here are two: Why isn't your Mailman using pythonlib, and
> Why did IncomingRunner die and not shunt the message?
Is it possible because of it was using python2.3's email lib instead of
Mailman's own?
Thanks a lot! The problem is fixed after the Mailman's own
pythonlib/email is installed.
Xueshan
--
>
--
------------------------------
Xueshan Feng (aka. Susan Feng)
Shared Services, ITSS
Stanford University, CA 94305-3090
255 Panama St.
Room 157, Polya Hall
Stanford University
Stanford, CA 94305-4136
More information about the Mailman-Users
mailing list