[Mailman-Developers] race condition in locking ?

Thomas Wouters thomas@xs4all.net
Mon, 7 Feb 2000 22:44:58 +0100


--oj4kGyHlBMXGt3Le
Content-Type: text/plain; charset=us-ascii

On Mon, Feb 07, 2000 at 10:01:36PM +0100, Harald Meland wrote:

> My current CVS checkout is available ("live" :) under <URL:
> http://www.uio.no/~hmeland/tmp/mailman-userdb/>, with <URL:
> http://www.uio.no/~hmeland/tmp/mailman-userdb/Mailman/LockFile.py>
> being of special interest in this discussion...

Ah, very cool. You'll find my current version of LockFile.py attached, for
reference, as you asked. I think the link() solutions, as opposed to the
rename() solution, is a better one, if you'll premit me. For one, you can
check on lock ownership by statting your own lockfile, which should only be
touched by you, instead of reading a possibly highly volatile lockfile. You
can also doublecheck the lockfile inode, see if it is equal to your own
lockfiles' inode, to confirm that you have the lock.

> > I kept the API the same with one teensy exception. With the new
> > locking scheme, it's not possible to 'steal' a lock, only to break
> > the lock and then enter the lock() cycle to try and grab it.

> Umm, I'm not sure that would go down very well with all of (current)
> Mailman, as it would mean you'd have to reload any data that might
> have changed on disk (marshals etc.) if someone else got the lock
> before you...

> I've been taking the approach that the parts of Mailman that actually
> call steal() have to know what they're doing (and grepping seemed to
> indicate that that was correct -- it's mostly used by forked-off
> children to steal locks held by the parent), and thus steal() is
> awfully blunt and just overwrites the current lock.

Ah, well, I couldn't find any parts of Mailman in the CVS tree that acually
_called_ steal, so I couldn't figure out what it should do in borderline
cases. However, reading your LockFile.py made me realise how to steal the
lock. LockFile.steal() now does its utmost best to steal the lock, only
failing if someone else is busy stealing the lock from us. It returns 1 for
success, 0 for failure, mostly because I wanted it to signify that in some
way.

There still are race conditions in there, however, and I doubt we'll ever
get rid of them. You assume the parent knows what it is doing, as where I
assume the parent _knew_ what it is doing, but can be wrong by now ;) Most
of these cases are, granted, _very_ low probability.

Well, let me know if you have any questions or comments regarding my
implementation.

-- 
Thomas Wouters <thomas@xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

--oj4kGyHlBMXGt3Le
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="LockFile.py"

# Copyright (C) 1998 by the Free Software Foundation, Inc.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software 
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.


"""Portable (?) file locking with timeouts.  

This code should work with all versions of NFS.  The algorithm was suggested
by the GNU/Linux open() man page.  Make sure no malicious people have access
to link() to the lock file.
"""

import socket, os, time
import string
import errno
#from stat import ST_NLINK
ST_NLINK = 3                                      # faster
ST_CTIME = 9
ST_INO   = 1


# default intervals are both specified in seconds, and can be floating point
# values.  DEFAULT_HUNG_TIMEOUT specifies the default length of time that a
# lock is expecting to hold the lock -- this can be set in the constructor, or 
# changed via a mutator.  DEFAULT_SLEEP_INTERVAL is the amount of time to
# sleep before checking the lock's status, if we were not the winning claimant 
# the previous time around.
DEFAULT_LOCK_LIFETIME   = 15
DEFAULT_SLEEP_INTERVAL  = 1.0
DEFAULT_INODE_GRACETIME = 2

from Mailman.Logging.StampedLogger import StampedLogger
_logfile = None



# exceptions which can be raised
class LockError(Exception):
    """Base class for all exceptions in this module."""
    pass

class AlreadyLockedError(LockError):
    """Raised when a lock is attempted on an already locked object."""
    pass

class NotLockedError(LockError):
    """Raised when an unlock is attempted on an objec that isn't locked."""
    pass

class TimeOutError(LockError):
    """Raised when a lock was attempted, but did not succeed in the given
    amount of time.
    """
    pass

class StaleLockFileError(LockError):
    """Raised when a stale hardlink lock file was found."""
    pass



class LockFile:
    """A portable way to lock resources by way of the file system."""
    def __init__(self, lockfile,
                 lifetime=DEFAULT_LOCK_LIFETIME,
                 sleep_interval=DEFAULT_SLEEP_INTERVAL,
                 withlogging=0):
        """Creates a lock file using the specified file.

        lifetime is the maximum length of time expected to keep this lock.
        This value is written into the lock file so that other claimants on
        the lock know when it is safe to break the lock, should the lock
        holder be wedged.

        sleep_interval is roughly how often to wake up and check the lock file

        """
        self.__lockfile = lockfile
        self.__lifetime = lifetime
        self.__sleep_interval = sleep_interval + (((os.getpid() % 20) - 11.5)/100.)
        self.__tmpfname = "%s.%s.%d" % (lockfile,
                                        socket.gethostname(),
                                        os.getpid())
        self.__withlogging = withlogging
        self.__writelog("starting")

    def set_lifetime(self, lifetime):
        """Reset the lifetime of the lock.
        Takes affect the next time the file is locked.
        """
        self.__lifetime = lifetime

    def __writelog(self, msg):
        global _logfile
        if self.__withlogging:
            if not _logfile:
                _logfile = StampedLogger('locks', manual_reprime=1)
            head, tail = os.path.split(self.__lockfile)
            _logfile.write('%s %s\n' % (tail, msg))

    def refresh(self, newlifetime=None):
        """Refresh the lock.

        This writes a new release time into the lock file.  Use this if a
        process suddenly realizes it needs more time to do its work.  With
        optional newlifetime, this resets the lock lifetime value too.

        NotLockedError is raised if we don't already own the lock.
        """
        if not self.locked():
            raise NotLockedError
        if newlifetime is not None:
            self.set_lifetime(newlifetime)
        self.__write()
        self.__writelog('lock lifetime refreshed for %d seconds' %
                        self.__lifetime)

    def __del__(self):
        if self.locked():
            self.unlock()

    def __write(self):
        # we expect to release our lock some time in the future.  we want to
        # give other claimants some clue as to when we think we're going to be
        # done with it, so they don't try to steal it out from underneath us
        # unless we've actually been wedged.
        lockrelease = time.time() + self.__lifetime
        # make sure it's group writable
        oldmask = os.umask(002)
        try:
            fp = open(self.__tmpfname, 'w')
            fp.write('%d %s %f\n' % (os.getpid(),
                                     self.__tmpfname,
                                     lockrelease))
            fp.close()
        finally:
            os.umask(oldmask)

    def __read(self):
        # can raise ValueError in three situations:
        #
        # - The lockfile does not exist (it's been released or broken)
        # - We didn't get a sequence of the right size from split
        # - the first 'word' is not an integer
        # In each case we raise ValueError in response
        try:
            fp = open(self.__lockfile, 'r')
        except IOError, err:
            if err.errno == errno.ENOENT:
                raise ValueError # invalid contents
        try:
            pid, winner, lockrelease = string.split(string.strip(fp.read()))
        finally:
            fp.close()
        return int(pid), winner, float(lockrelease)

    def __break(self):
       try:
           if os.stat(self.__lockfile)[ST_CTIME] > time.time() - DEFAULT_INODE_GRACETIME:
               # The lockfile has changed very recently, so it might have
               # changed since our caller evaluated the validity
               return
       except OSError, err:
           if err.errno == errno.ENOENT:
               return # Someone broke it already
           raise
       os.unlink(self.__lockfile)

    # Note that this function is mostly redundant, the try/except around
    # os.link does almost all of what we do. Instead of lockfile-contents
    # we could use lockfile CTIME, which would probably speed things up.
    def lock(self, timeout=0):
        """Blocks until the lock can be obtained.

        Raises a TimeOutError exception if a positive timeout value is given
        and that time elapses before the lock is obtained.

        This can possibly steal the lock from some other claimant, if the lock 
        lifetime that was written to the file has been exceeded.  Note that
        for this to work across machines, the clocks must be sufficiently
        synchronized.

        """
        if self.locked():
            self.__writelog('already locked')
            raise AlreadyLockedError
        if timeout:
            timeout_time = time.time() + timeout
        last_pid = -1
        self.__write() # Make sure our own lockfile exists
        while 1:
            # create the hard link and test for exactly 2 links to the file
            try:
                os.link(self.__tmpfname, self.__lockfile)
            except OSError, err:
                if err.errno == errno.ENOENT:
                    # Our own lockfile has vanished from beneath us ?
                    self.__write()
                    # sleep anyway, so we dont hog the machine
                    self.__sleep()
                    continue
                elif err.errno == errno.EEXIST:
                    # We failed to get the lock. If we worry about this locking
                    # scheme, we set a variable here and ascertain ourselves of 
                    # its value later, when we think we have the lock.
                    pass
                else:
                    raise

            if os.stat(self.__tmpfname)[ST_NLINK] == 2:
                # we have the lock (since noone overwrote our link to the lock
                # file), so we re-piss our hydrant, to refresh the lock timeout
                self.__write()
                now = time.time()
                if not self.locked():
                    self.__writelog('false positive on lockfile (%f)'%now)
                    self.__sleep()
                    continue
                self.__writelog('got the lock (%f)'%now)
                break
            # we didn't get the lock this time.  let's see if we timed out
            if timeout and timeout_time < time.time():
                os.unlink(self.__tmpfname)
                self.__writelog('timed out')
                raise TimeOutError
            # someone else must have gotten the lock.  let's find out who it
            # is.  if there is some bogosity in the lock file's data then we
            # will try to steal the lock.
            try:
                pid, winner, lockrelease = self.__read()
            except ValueError:
                self.__writelog('invalid lock contents; breaking lock')
                self.__break()
                self.__sleep()
                continue

            # check for hanging processes, by remembering pid, and if it stays
            # the same, checking lockrelease.
            if pid <> last_pid:
                last_pid = pid

            elif lockrelease < time.time():
                # We've waited long enough...
                # Attempt to break the lock. We dont care much about the old
                # winners __tmpfname, let it stay.
                self.__writelog('lock timeout; breaking lock')
                self.__break()
                self.__sleep()
                continue
            # okay, someone else has the lock, we weren't able to steal it,
            # and our claim hasn't timed out yet.  So let's wait a while for
            # the owner of the lock to give it up.
            self.__writelog('waiting for claim')
            self.__sleep()
    
    def __sleep(self):
        time.sleep(self.__sleep_interval)

    # This could error if the lock is stolen.  You must catch it.
    def unlock(self):
        """Unlock the lock.

        If we don't already own the lock (either because of unbalanced unlock
        calls, or because the lock was stolen out from under us), raise a
        NotLockedError.
        """
        if not self.locked():
            raise NotLockedError
        now = time.time()
        os.unlink(self.__lockfile)
        os.unlink(self.__tmpfname)
        self.__writelog('unlocked (%f)'%now)

    def locked(self):
        """Returns 1 if we own the lock, 0 if we do not."""
        if not os.path.exists(self.__tmpfname):
            return 0
        if os.stat(self.__tmpfname)[ST_NLINK] <> 2:
            return 0
        # The above check for nlink should be enough to validate our claim on
        # the lock. Nevertheless, we are paranoid.
        try:
            pid, winner, lockrelease = self.__read()
        except ValueError:
            # the contents of the lock file was corrupted, or the global lock
            # is not a link to our lock.
            if os.stat(self.__lockfile)[ST_INO] <> os.stat(self.__tmpfname)[ST_INO]:
                self.__writelog("nlink of 2 but we don't own the lock ? strangeness")
                os.unlink(self.__tmpfname)
                self.__write()
                return 0
            else:
                # Okay, we really do have the lock, but the contents is messed up
                self.__write()
                return 1
        return pid == os.getpid()

    # use with caution!!!
    def steal(self):
        """Explicitly steal the lock.  USE WITH CAUTION!"""
        self.__writelog('explicitly breaking lock')
        if self.locked():
	    return 1
        try:
            pid, winner, timeout = self.__read()
            os.rename(winner, self.__tmpfname)
        except ValueError:
            return 0
        except OSError, err:
            if err.errno == errno.ENOENT:
                return 0
            raise
        self.__write()
        return self.locked()


--oj4kGyHlBMXGt3Le--