[Mailman-Developers] Re: [Mailman-Users] Big problems with stale lockfiles on large list...

Wed May 2 06:28:03 CEST 2001

On 5/1/01 8:52 PM, "Barry A. Warsaw" <barry at digicool.com> wrote:

> I've avoid that because of NFS issues, i.e. if you've got multiple
> Mailman installations sharing an NFS partition, the pids aren't
> relevant. 

If you have that, don't you have chaos anyway? Is the create&link lock style
reliable over NFS in the first place? Isn't putting locks from multiple
machiens in the same directory just a plain old bad idea?

I ran into a strange little problem today -- I'm using time() to generate a
filename for a temporary directory. Works great; until you start running
multiple processes on a 2 CPU machine. I started having two processes get
the same time() value (which is impossible on a single CPU system) and fight
over the same directory. I'm now doing a random() based sleep to get away
from this. 

It seems to me that sharing a single directory for locks over NFS is asking
for the same kind of weirdie problems I got to track down today... NFS
changes the paradigms enough, especially about atomic operations, that I'm
worried you're asking for issues here. I'd either put locks on a local disk,
or make sure each machine has its own non-shared directory. And fi you do,
the proc information will be relevant....

I really think the lock-setting code needs some form of "this is dead, break
it" code in it -- that solves 99% of the problem, really. I know you can't
depend on flock(), so that the kernel manages locks, but perhaps the config
code could test for it nad use it if it exists and fall back to the current
system if it doesn't?

> The program can't know that, but the sysadmin can, so I'm
> inclined to instead write a script that will zap old locks if their
> processes don't exist.

I've thought about that, also, but it seems like a duct tape solution to me.

> I think the only critical section is MailList.Save(), or more
> accurately, MailList.__save().  But even here I think you're as safe
> as possible because Mailman writes new state using the following
> algorithm:
> 
> - open a config.db.tmp.hostname.pid file

Okay, but... What if we go away after this is created. What is in charge of
cleaning up leftovers?  Realize I'm stretching a point here -- but in an
extreme case, if nothing cleans this stuff up, you have a two-pronged denial
of service attack. One would be when all of the .pid numbers have temp files
created, so future attempts start failing, the other is when you have enough
of the tmp files that the disk fills up... Either long-term neglect or a
motivated dinker could shut a list server down....

> You have to be careful but fast when you get that SIGTERM because
> three seconds later you're getting SIGKILLed and at that point, you're
> screwed. 

Have you considered forking and detaching for the write? At that point, you
could daemonize a sub-process to do the actual DB update, and the parent
handles talking to the user, so if it's aborted, it wn't be killed. At some
point, you pass a go/no-go point and if it's go, you can safely detach from
the user and isolate yourself so you know you'll finish....