[Mailman-Developers] List locks not getting relinquished

Barry Warsaw barry at python.org
Tue Jan 20 19:02:54 EST 2004


On Mon, 2004-01-19 at 12:43, Chris Boulter wrote:

> I'm having some problems with lists not getting unlocked. I don't really
> know how to set about debugging this and I haven't found anything in this
> list's archive or the documentation to help. My Mailman installation is
> quite non-standard, so it may be hard to provide help, but I'd appreciate
> hints about how to debug the locking mechanism (for instance, can I tell
> which process created a lock?).

I was going to suggest LIST_LOCK_DEBUGGING, but that only turns on
logging for the lock that the MailList object creates.  That /ought/ to
be sufficient to get gobs of output, but you can also try hacking
LockFile.py to set the default withlogging argument on the constructor
to True.  That way, you'll see all the output from all lock operations. 
Probably won't help much though. ;)

> I'm using Mailman 2.1.2. We have sync_members running in a loop from a
> daemon, syncing ~500 lists every few hours. Of these, half a dozen lists are
> failing (and my daemon has to forcibly kill the sync_members process when it
> detects a timeout). 

I want to know what "failing" means.  Do you get tracebacks?  Does the
process just hang?  Does it work but none of the changes persist?

Now, if the process is hanging, and you're running them on a live site,
with messages being processed, cron jobs running, people hitting the web
site, well, that's not unexpected!  The locks are there specifically so
that only one process can modify a list's configuration at a time. 
Mailman tries to be careful to only acquire a list lock when it needs
write access to the mailing list.  Say your sync_members script tries to
lock a list at the same time someone is sending a message through the
list.  It's entirely possible the incoming qrunner process will get the
lock, causing sync_members to block until the lock is available.

Lock acquisition order is completely non-deterministic too.  If multiple
processes lays claim to the same list lock, you've no way of determining
in which order or when each process will acquire it.  The default
setting is for no timeouts on the lock acquisition.

> There seems to be no common feature to the lists which
> fail. When it happens, I can forcibly break the locks and restore correct
> behaviour by manually deleting Mailman/locks/*, but then the problem recurs
> eventually. However, once I've broken the locks, I can successfully run
> sync_members on the lists from the command line hundreds or thousands of
> times without failure, either making changes to the subscribers or not each
> time.

This jives with the above scenario.

> So maybe it's something about the daemon which doesn't relinquish a lock.

Or maybe it just takes a long time to release the lock, or for the
sync_members process to acquire it.  Turning on debugging should give
you output as to which process is laying claim to the lock, which
process gets the lock, and when the process releases it.  You can always
find the process number by looking at the lock files.  The one with the
lock will have a hard link to the generic lock file.  The pid will also
be in the lock file.

> The command being run by the daemon, and the stacktrace when I ran the
> command manually and hit ^C are below (nb I've modified sync_members and
> added an '--ignore-invalid' option, so the line numbers below aren't correct
> for the standard distro).
> 
> 
> /usr/local/mailman/bin/sync_members --ignore-invalid --welcome-msg=no
> --goodbye-msg=no --notifyadmin=no --file /var/tmp/mlmsync124897.tmp
> latin.america
> ^CTraceback (most recent call last):
>   File "/usr/local/mailman/bin/sync_members", line 301, in ?
>     main()
>   File "/usr/local/mailman/bin/sync_members", line 234, in main
>     mlist = MailList.MailList(listname)
>   File "/usr/local/mailman/Mailman/MailList.py", line 122, in __init__
>     self.Lock()
>   File "/usr/local/mailman/Mailman/MailList.py", line 155, in Lock
>     self.__lock.lock(timeout)
>   File "/usr/local/mailman/Mailman/LockFile.py", line 312, in lock
>     self.__sleep()
>   File "/usr/local/mailman/Mailman/LockFile.py", line 496, in __sleep
>     time.sleep(interval)
> KeyboardInterrupt

You've interrupted the process while it's sleeping trying to acquire the
lock.  This has to mean either another live process has the lock
(perhaps it's taking a long time to do something), or there's a stale
lock around.  Mailman is very careful to release the lock when it's no
longer necessary, but if something happens like a process gets kill -9'd
or Python core dumps, or your machine crashes, then a stale lock can
result.  It's easy to find out if the lock is stale by ps'ing the pid
that last acquired it.

HTH,
-Barry





More information about the Mailman-Developers mailing list