[Mailman-Developers] List locks not getting relinquished

Chris Boulter chris at jellybaby.net
Wed Jan 21 12:01:41 EST 2004


On Tue 2004-01-20 19:02:54 -0500, Barry Warsaw wrote:
> On Mon, 2004-01-19 at 12:43, Chris Boulter wrote:
> > I'm using Mailman 2.1.2. We have sync_members running in a loop from a
> > daemon, syncing ~500 lists every few hours. Of these, half a dozen lists are
> > failing (and my daemon has to forcibly kill the sync_members process when it
> > detects a timeout). 
> 
> I want to know what "failing" means.  Do you get tracebacks?  Does the
> process just hang?  Does it work but none of the changes persist?

The sync_members process on the affected lists just hangs. I assume it gets
to the same point as it does when I manually run sync_members, i.e. it's
sleeping waiting for the lock. However, it never actually obtains the lock.
Currently I have my daemon kill the sync_members process five minutes after
starting it if it hasn't finished. Before I introduced this, I saw
sync_members hanging for several hours (at which point I would intervene and
kill it manually). So I don't think it's just down to a slow-running process
(unless it's reeeeally slow).

When sync_members hangs, I also find that I can't log on to the web
interface for that list (browser sits spinning after I enter the site
password). I think this further points to an unrelinquished lock, so that
the CGI program can't get the lock on the list it needs.

> Now, if the process is hanging, and you're running them on a live site,
> with messages being processed, cron jobs running, people hitting the web
> site, well, that's not unexpected!

Certainly that could explain an occasional hang, but I see it every time
with certain lists, until I manually break the locks by deleting the lock
file. This happens both on our live site and on a development machine (which
rarely has any users except me).

> You've interrupted the process while it's sleeping trying to acquire the
> lock.  This has to mean either another live process has the lock
> (perhaps it's taking a long time to do something), or there's a stale
> lock around.  Mailman is very careful to release the lock when it's no
> longer necessary, but if something happens like a process gets kill -9'd
> or Python core dumps, or your machine crashes, then a stale lock can
> result.  It's easy to find out if the lock is stale by ps'ing the pid
> that last acquired it.

This is interesting. Presumably a lock named
        entrepreneurship.club.lock.pavo.9897.0
would have been acquired by pid 9897. Looking at my locks, they do seem to
have been acquired by processes which no longer exist (possibly sync_members
processes started by my daemon then killed after the timeout).

One way I hoped Mailman might resolve this would be to ignore locks older
than a certain age, or delete these locks. I guess it doesn't have this
function though (I couldn't find it in Defaults.py).

Also, I've boldly gone in and deleted Mailman/locks/*. If Mailman expects to
find a master lock file and make links to it, could this approach to
breaking locks be dangerous?

> HTH,

Yes indeed. Thanks for your detailed comments. It sounds like I might have
to do penance with the logger though.

Chris



More information about the Mailman-Developers mailing list