[Mailman-Users] Messages got stuck in "in" queue due to one "bad"message

Xueshan Feng sfeng at stanford.edu
Wed Feb 21 04:02:02 CET 2007


On Tue, 2007-02-20 at 15:37 -0800, Mark Sapiro wrote:
> Xueshan Feng wrote:
> >
> >We are running mailman 2.1.9. Recently we had a problem that mailman's
> >Incoming qrunner died hard, which caused messages accumulated under
> >qfiles/in directory.  Restarting mailman didn't help. We finally
> >identified a message on top of the queue, moved it aside, restarted
> >mailman again, then mail will start flow. 
> >
> >Putting that message back to the in queue will trigger the problem. The
> >sender actually tried to send it again, and it caused the same problem.
> >
> >Before I got into the details of mailman logs, I'd like to ask:
> >
> >1. What's the best way to detect "stalled" condition like this? We do
> >monitor qrunner processes but in this case it didn't help because the
> >qrunners were still running. Now I put in a script to monitor the number
> >of files in "in" queue, and if it reaches a threshold, sends alert. 
> 
> 
> Actually, If I understand the situation correctly, IncomingRunner will
> not be running. It will die and be restarted, but after the 10th
> restart/die, it won't restart again (but this seems incorrect - see
> below).

The qrunner log shows they died, but if I do ps -ef |grep
IncomingRunner, they are still there. 


> 
> 
> >2. Why this message didn't get moved to shunt directory?
> 
> 
> I don't know why it didn't shunt. From the traceback, the message
> should have shunted and that should have been the end of it. Also, the
> problem with decode_rfc2231 in email.Utils that seems to be the
> underlying issue here was fixed in email 2.5.8 which shipped with
> Mailman 2.1.9 and should be in Mailman's pythonlib directory. Your
> traceback says you are getting the email library from your Python 2.3
> installation. This is not correct and should not happen if Mailman is
> properly installed.

You nailed it! 

We indeed re-packaged Mailman with a lot of Stanford's own patches. When
we upgrade from 2.1.8 to 2.1.9, I missed  Mailman's own pythonlib/email
installation in Debian rule file.

I just re-packaged it and tested the new package. The message that
caused the problem now is accepted without a problem!


> 
> >3. Can the mailman recover itself without human intervention? 

> I don't understand why the message didn't shunt. I also don't
> understand why IncomingRunner was restarted more than 10 times. Have
> you changed
> 
> MAX_RESTARTS = 10
> 
> in bin/mailmanctl?

No that's not changed. It is still 10. 


> 
> Was there a second traceback with only one date/time header following
> the one you report? If so, what was it?

Yes, there were more similar traceback in the logs. There are 10
actually.

> 
> Have you changed Mailman/Queue/Runner.py in any way?

Not that program.


> 
> >
> >Details are followed:
> 
> 
> More comments below:
> 
> 
> >The following error message will repeat a couple of times in error log
> >when this message was processed: 
> >
> >
> >Feb 20 07:55:53 2007 qrunner(25722): Traceback (most recent call last):
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/var/lib/mailman/bin/qrunner", line 278, in ?
> >Feb 20 07:55:53 2007 qrunner(25722):      main()
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/var/lib/mailman/bin/qrunner", line 238, in main
> >Feb 20 07:55:53 2007 qrunner(25722):      qrunner.run()
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run
> >Feb 20 07:55:53 2007 qrunner(25722):      filecnt = self._oneloop()
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/mailman/Mailman/Queue/Runner.py", line 100, in _oneloop
> >Feb 20 07:55:53 2007 qrunner(25722):      msg, msgdata =
> >self._switchboard.dequeue(filebase)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/mailman/Mailman/Queue/Switchboard.py", line 164, in dequeue
> >Feb 20 07:55:53 2007 qrunner(25722):      msg =
> >email.message_from_string(msg, Message.Message)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/__init__.py", line 52, in message_from_string
> 
> 
> This and subsequent email modules should come from
> /usr/lib/mailman/pythonlib/email, not from /usr/lib/python2.3/email.
> 
> 
> >Feb 20 07:55:53 2007 qrunner(25722):      return Parser(_class,
> >strict=strict).parsestr(s)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Parser.py", line 75, in parsestr
> >Feb 20 07:55:53 2007 qrunner(25722):      return
> >self.parse(StringIO(text), headersonly=headersonly)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Parser.py", line 64, in parse
> >Feb 20 07:55:53 2007 qrunner(25722):      self._parsebody(root, fp,
> >firstbodyline)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Parser.py", line 240, in _parsebody
> >Feb 20 07:55:53 2007 qrunner(25722):      msgobj = self.parsestr(part)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Parser.py", line 75, in parsestr
> >Feb 20 07:55:53 2007 qrunner(25722):      return
> >self.parse(StringIO(text), headersonly=headersonly)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Parser.py", line 64, in parse
> >Feb 20 07:55:53 2007 qrunner(25722):      self._parsebody(root, fp,
> >firstbodyline)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Parser.py", line 146, in _parsebody
> >Feb 20 07:55:53 2007 qrunner(25722):      boundary =
> >container.get_boundary()
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Message.py", line 743, in get_boundary
> >Feb 20 07:55:53 2007 qrunner(25722):      boundary =
> >self.get_param('boundary', missing)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Message.py", line 608, in get_param
> >Feb 20 07:55:53 2007 qrunner(25722):      for k, v in
> >self._get_params_preserve(failobj, header):
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Message.py", line 555, in _get_params_preserve
> >Feb 20 07:55:53 2007 qrunner(25722):      params =
> >Utils.decode_params(params)
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Utils.py", line 337, in decode_params
> >Feb 20 07:55:53 2007 qrunner(25722):      charset, language, value =
> >decode_rfc2231(EMPTYSTRING.join(value))
> >Feb 20 07:55:53 2007 qrunner(25722):   File
> >"/usr/lib/python2.3/email/Utils.py", line 284, in decode_rfc2231
> >Feb 20 07:55:53 2007 qrunner(25722):      charset, language, s = parts
> >Feb 20 07:55:53 2007 qrunner(25722): ValueError :  unpack list of wrong
> >size
> 
> 
> Is there a second, different traceback immediately following this?
> 
> 
> 
> >The qrunner log shows:
> >Feb 20 07:55:52 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 27941, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25722) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25722, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25723) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25723, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25724) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25724, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25725) IncomingRunner qrunner started.
> >Feb 20 07:55:53 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25725, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:53 2007 (25726) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25726, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25727) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25727, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25728) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25728, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25729) IncomingRunner qrunner started.
> >Feb 20 07:55:54 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25729, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:54 2007 (25731) IncomingRunner qrunner started.
> >Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25731, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> >Feb 20 07:55:55 2007 (25734) IncomingRunner qrunner started.
> >Feb 20 07:55:55 2007 (27933) Master qrunner detected subprocess exit
> >(pid: 25734, sig: None, sts: 1, class: IncomingRunner, slice: 1/4)
> >[restarting]
> 

> 
> Why more than 10 restarts?

Hmm, that's 11. I don't know. The following line followed the 11th
restarting:

Feb 20 07:55:55 2007 (27933) Qrunner IncomingRunner reached maximum
restart limit of 10, not restarting.


> 
> There is a post at
> <http://mail.python.org/pipermail/mailman-developers/2007-February/019426.html>
> about a somewhat different problem with a similar result. There is a
> patch attached to that post which may help this situation, but the
> real issues here are two: Why isn't your Mailman using pythonlib, and
> Why did IncomingRunner die and not shunt the message?

Is it possible because of it was using python2.3's email lib instead of
Mailman's own?

Thanks a lot! The problem is fixed after the Mailman's own
pythonlib/email is installed.

Xueshan
--


> 
-- 
------------------------------
Xueshan Feng (aka. Susan Feng)            

Shared Services, ITSS
Stanford University, CA 94305-3090

255 Panama St.
Room 157, Polya Hall
Stanford University
Stanford, CA 94305-4136





More information about the Mailman-Users mailing list