[Mailman-Users] Admin email on errors?

Mark Sapiro msapiro at value.net
Mon Mar 20 20:19:02 CET 2006


Kurt Werle wrote:
>
><quote who="Mark Sapiro">
>> Kurt Werle wrote:
>>
>>
>>> I have twice got the following error:
>>> ---
>>> Mar 14 22:41:58 2006 mailmanctl(54): The master qrunner lock could not
>>> be acquired.  It appears as though there is a stale master qrunner lock.
>>> Try re-running mailmanctl with the -s flag.
>>> Mar 14 22:41:58 2006 mailmanctl(54):
>>> Mar 14 22:56:02 2006 mailmanctl(48): The master qrunner lock could not
>>> be acquired, because it appears as if some process on some other host may
>>> have acquired it.  We can't test for stale locks across host boundaries,
>>> so you'll have to do this manually.  Or, if you know the lock is stale,
>>> re-run mailmanctl with the -s flag. ---
>>
>> These errors are the result of a 'mailmanctl start' when mailmanctl was
>> either already running or died in some 'unclean' way. Are there any
>> messages in the 'error' or 'qrunner' logs that might illuminate the
>> problem that caused you to want to do the 'mailmanctl start' in the first
>> place.
>
>Not that I can see - though it looks like it did some thrashing AFTER
>writing that log.


Those messages come from bin/mailmanctl's acquire_lock() function which
is only called at the beginning of processing a 'mailmanctl start'
command. Thus, I stand by my statement above.


>qrunner...
>Mar 17 09:18:22 2006 (9277) VirginRunner qrunner started.
>Mar 17 09:18:22 2006 (9276) OutgoingRunner qrunner started.
>Mar 17 09:18:23 2006 (9274) IncomingRunner qrunner started.
>Mar 17 09:18:23 2006 (9275) NewsRunner qrunner started.
>Mar 17 09:18:23 2006 (9278) RetryRunner qrunner started.
>Mar 17 09:18:23 2006 (9271) ArchRunner qrunner started.
>Mar 17 09:18:23 2006 (9273) CommandRunner qrunner started.
>Mar 17 09:18:23 2006 (9272) BounceRunner qrunner started.
>Mar 17 09:34:34 2006 (9270) Master watcher caught SIGTERM.  Exiting.


This indicates a 'mailmanctl stop' command or some other event resulted
in a SIGTERM being sent to the running mailmanctl. (And are these
messages from two and a half days later supposed to be related to
those above?)


>Mar 17 09:34:34 2006 (9271) ArchRunner qrunner caught SIGTERM.  Stopping.
>Mar 17 09:34:34 2006 (9271) ArchRunner qrunner exiting.
>Mar 17 09:34:34 2006 (9272) BounceRunner qrunner caught SIGTERM.  Stopping.
>Mar 17 09:34:34 2006 (9272) BounceRunner qrunner exiting.
>Mar 17 09:34:34 2006 (9273) CommandRunner qrunner caught SIGTERM.  Stopping.
>Mar 17 09:34:34 2006 (9273) CommandRunner qrunner exiting.
>Mar 17 09:34:34 2006 (9274) IncomingRunner qrunner caught SIGTERM.  Stopping.
>Mar 17 09:34:34 2006 (9274) IncomingRunner qrunner exiting.
>Mar 17 09:34:34 2006 (9275) NewsRunner qrunner caught SIGTERM.  Stopping.
>Mar 17 09:34:34 2006 (9275) NewsRunner qrunner exiting.
>Mar 17 09:34:34 2006 (9276) OutgoingRunner qrunner caught SIGTERM.  Stopping.
>Mar 17 09:34:34 2006 (9276) OutgoingRunner qrunner exiting.


And these are the result of the subsequent normal shutdown.


>> And, if not and mailmanctl died ungracefully, it's unlikely to be
>> able to successfully send you an email about the situation.
>
>You're telling me that it has the presense of mind to write an error log,
>but can't do the equivalent of
>echo mailman died | mail -s 'mailman died' $ADMIN
>
>I'm not buying that.


No. I'm telling you that your original post included nothing about why
mailmanctl or any qrunner stopped in the first place, thus I had no
evidence that anything was written when it stopped, only the messages
from subsequent start attempts and the fact that if it had in fact
died, it did so without removing the lock file.

So were there log messages about the original termination prior the the
start attempts at Mar 14 22:41:58 2006 and Mar 14 22:56:02 2006?


>>> Has anyone
>>> hacked it in?  Do I have to write a cron job that will poll the process
>>> list to see if mailman is still running?  Has anyone written that
>>> already?
>>
>> There are posts in the mailman-users archives about this.
>
>I did some searching, but couldn't find them.


I know I bring this on myself by doing it so much, but I don't like
being used as a search engine for the Mailman FAQ and mailman-users
archives.

Try the entire thread that begins at
<http://mail.python.org/pipermail/mailman-users/2005-May/044888.html>
which I found fairly quickly with
<http://www.google.com/search?q=site:mail.python.org++inurl:mailman-users++cron+restart+mailmanctl>.


>> The real solution is to find the underlying problem and fix it so
>> Mailman doesn't die.
>
>I agree that software shouldn't crash.  I disagree that it won't crash.  I
>insist that when server software crashes, it should send mail to an admin.


It's an open source project. We're all volunteers. Feel free to
implement whatever you need. Insisting that others do the work won't
get you very far.

-- 
Mark Sapiro <msapiro at value.net>       The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan




More information about the Mailman-Users mailing list