[Mailman-Developers] 2.0.8 - qrunner on solaris

Barry A. Warsaw barry@zope.com
Fri, 29 Mar 2002 14:02:42 -0500


>>>>> "RE" == Rob Ellis <rob@web.ca> writes:

    RE> I'm not sure if this is appropriate here, so apologies in
    RE> advance if it isn't...

Yup, this is the right place.

    RE> We're using Mailman 2.0.8 on a sun ultra with solaris 8,
    RE> and we were running into two problems with qrunner...

    RE> The first was a race condition where this code in qrunner

    RE> - # If the .db file has no corresponding .msg file, we might
    RE> as well - # remove the .db file, since there's little we can
    RE> do about it.  - if not os.path.exists(root+'.msg'): -
    RE> syslog('qrunner', 'Unlinking orphaned .db file: %s.db' % root)
    RE> - os.unlink(root+'.db')

    RE> was unlinking .db files before their corresponding .msg files
    RE> were created. We just removed those lines and it seems to be
    RE> ok -- no orphaned .db files and no more orphaned .msg files.

I don't think this is quite the right fix.  In MM2.0.x, qrunner keys
off the .msg file, and Message.Enqueue() is careful to write the .db
file first and then the .msg file.  You're right that there's a race
condition though, because the test in qrunner's main() is backwards!

Since the .db is written first, followed by the .msg, you really want
to look for orphaned .msg files not .db files.  This is because .db
files may legitimately not have a corresponding .msg file if the
enqueing processes hasn't completed yet.  See the attached patch for a
better solution, which also includes some paranoia guards in
Message.Enqueue() so that it atomically makes the .msg files
available.

Note: this patch is untested, and I don't think I'll get to testing it
today.  I will likely test it next week on python.org (yikes!) and if
it looks good, I will issue a MM2.0.9 with this fix (and a few
others).  I'd appreciate it if someone else could verify this works
and doesn't break anything else.

    RE> The second problem was that qrunner was crashing on some kinds
    RE> of bounce messages (?), causing messages to pile up in the
    RE> queue occasionally. It was generating errors like this:

    RE> The patch below seems to fix the problem (?)... we're now 
    RE> seeing some errors like this:

Good patch.  This can happen if a message doesn't have a Content-Type:
header at all.  This will be in MM2.0.9 as well.

-Barry

-------------------- snip snip --------------------
Index: Mailman/Message.py
===================================================================
RCS file: /cvsroot/mailman/mailman/Mailman/Message.py,v
retrieving revision 1.40
diff -u -r1.40 Message.py
--- Mailman/Message.py	2 Aug 2000 02:00:13 -0000	1.40
+++ Mailman/Message.py	29 Mar 2002 19:00:57 -0000
@@ -177,11 +177,14 @@
         marshal.dump(msgdata, dbfp)
         dbfp.close()
         # If it doesn't already exist, or if the text of the message has
-        # changed, write the message file to disk.
+        # changed, write the message file to disk, but do it atomically so as
+        # not to hit a race condition with the qrunner.
         if not existsp or dirty:
-            msgfp = Utils.open_ex(msgfile, 'w')
+            tmpfile = msgfile + '.tmp'
+            msgfp = Utils.open_ex(tmpfile, 'w')
             msgfp.write(text)
             msgfp.close()
+            os.rename(tmpfile, msgfile)
 
 
 
Index: cron/qrunner
===================================================================
RCS file: /cvsroot/mailman/mailman/cron/Attic/qrunner,v
retrieving revision 1.18.2.3
diff -u -r1.18.2.3 qrunner
--- cron/qrunner	18 Apr 2001 03:58:35 -0000	1.18.2.3
+++ cron/qrunner	29 Mar 2002 19:00:58 -0000
@@ -195,13 +195,11 @@
         lock.refresh()
         root, ext = os.path.splitext(os.path.join(mm_cfg.QUEUE_DIR, file))
         if ext == '.db':
-            # If the .db file has no corresponding .msg file, we might as well
-            # remove the .db file, since there's little we can do about it.
-            if not os.path.exists(root+'.msg'):
-                syslog('qrunner', 'Unlinking orphaned .db file: %s.db' % root)
-                os.unlink(root+'.db')
-            # just trigger off the .msg files
+            # Just trigger off the .msg files
             continue
+        if not os.path.exists(root+'.db'):
+            syslog('qrunner', 'Unlinking orphaned .msg file: %s.msg' % root)
+            os.unlink(root+'.msg')
         # Have we exceeded either resource limit?
         if mm_cfg.QRUNNER_PROCESS_LIFETIME is not None and \
                (time.time() - t0) > mm_cfg.QRUNNER_PROCESS_LIFETIME: