[Mailman-Developers] Re: [Mailman-Users] Big problems with stale lockfiles on large list...

Wed May 2 05:52:03 CEST 2001

>>>>> "CVR" == Chuq Von Rospach <chuqui at plaidworks.com> writes:

    >> However, if I hit the stop button before the page is finished
    >> loading, I can see that the CGI process continues to run for a
    >> while and then it may or may not clear the locks.

    CVR> That would match something I've been seeing and sorta tring
    CVR> to debug (except it seems for me it only happens when I'm not
    CVR> watching, of course).

Same here, the first 15 or so times I tried to recreate the bug.  I
had nearly as long a message already composed that contained a
different evaluation and summarized as "works for me".  Had to delete
that and start over when I actually /did/ reproduce it. ;)

    >> So my next approach is to write a very minimal signal handler
    >> that only unlocks the list, and install this on SIGTERM.

    CVR> That's work, but I suggest an alternative, or a second item:
    CVR> when you try to set a lock, and one is set, see if the
    CVR> process that set the lock still exists (the info is available
    CVR> in the locks/ dir with a bit of poking). If that process is
    CVR> gone, delete the lock and move forward.

    CVR> That way, both sides of the equation can fix the problem if
    CVR> needed.

I've avoid that because of NFS issues, i.e. if you've got multiple
Mailman installations sharing an NFS partition, the pids aren't
relevant.  The program can't know that, but the sysadmin can, so I'm
inclined to instead write a script that will zap old locks if their
processes don't exist.  That way the site admin can run those scripts
as he sees appropriate based on his installation.

    >> If you've read this far, the implication is that if the user
    >> hits the stop button, Mailman will in essence abort any changes
    >> to list configuration that this invocation may have made.

    CVR> As it should, IMHO. The only caveat, I think, is that you
    CVR> need to look through the code for places where breaking in
    CVR> the middle can leave you with incomplete or corrupted data,
    CVR> and protect those pieces from breakage, and handle the
    CVR> interrupt once you leave them.

    CVR> If you can be sure that won't happen, great. But I'd make
    CVR> double-sure...

I think the only critical section is MailList.Save(), or more
accurately, MailList.__save().  But even here I think you're as safe
as possible because Mailman writes new state using the following
algorithm:

- open a config.db.tmp.hostname.pid file

- write the new state to this temp file

- unlink config.db.last

- create a hard link config.db <-> config.db.last

- atomically rename config.db.tmp.hostname.pid to config.db

If you get the SIGTERM during any of those steps, I think you're still
guaranteed to have a valid config.db or config.db.last file, and in
the presence of config.db begin MIA, Mailman automatically falls back
to config.db.last (and if config.db.last is MIA, config.db should
still be valid and in place).  It's possible that the new state in the
tmp file won't become current, but that's what I meant by the abort
implication, and I think that's fine (and actually correct semantics
-- I agree with you Chuq).  If you get the signal in the middle of
writing the config.db.tmp file, then oh well, it's corrupt, but it'll
never be made the current state.

You have to be careful but fast when you get that SIGTERM because
three seconds later you're getting SIGKILLed and at that point, you're
screwed.  I think we're safe, at least for the config.db files.  I
need to make sure that other files like request.db are safe from
corruption (I actually think this one might be vulnerable because it
doesn't take the same precautions as with config.db).

-Barry