[Mailman-Developers] 2.0.x qrunner bug (bad one)

Barry A. Warsaw barry@zope.com
Tue, 2 Apr 2002 17:45:19 -0500


>>>>> "CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:

    CVR> I've just identified a pretty bad bug in mailman 2.0.x
    CVR> qrunner. It can cause messages to get lost, so, I almost hate
    CVR> to say this, Barry, but it might be time for a quick 2.0.9
    CVR> patch. Given the changes to queuing in 2.1, I think this bug
    CVR> isn't relevant to the 2.1 tree.

You know, I think I just fixed this on Friday, although I didn't get a
chance to check everything into cvs before the holiday weekend.  I
definitely didn't get a chance to test it.

It is a valid bug, and it does warrant a 2.0.9 patch.  The basic bug
is caused by a disagreement on the order of .db and .msg file writing
between qrunner and Message.Enqueue(), as Chuq rightly observes.  I
think the fix is simpler than what Chuq outlines, though.

Message.Enqueue() breaks the race by writing the .db file before it
writes the .msg file, but qrunner's logic is backwards!  It ignores
the .msg files but it should be ignoring the .db files, since they're
written first.

The fix is to qrunner, which should ignore .db files, triggering only
on .msg files.  If it finds a .msg file without a corresponding .db
file, then it should unlink the orphaned .msg file.  The final piece
of the puzzle is that Message.Enqueue() should write the .msg file
atomically, meaning, write it to a tmp file and use rename() to move
it into place atomically.

    CVR> Normally, I'd say "on to 2.1", but since this is a fairly
    CVR> serious "silent data loss" bug AND the fix is trivial, I
    CVR> think it might make sense to patch this and roll 2.0.9. At
    CVR> the least, I think a patch needs to be approved by Barry and
    CVR> released and made visible on the lists.org website.

Everything's checked into cvs now, and I'm about to do some testing.
The more eyeballs on this code, the better, since it is so integral to
the proper operation of the system.  After some off-line stress
testing, I'll foist this patch on python.org, watch the logs for a day
or two, and then do the 2.0.9 release.

    CVR> Please don't ask me how I found this. I'd have to kill
    CVR> you. But this is one of the more obscure bugs I've ever
    CVR> found... (grin)

Indeed.  While I've seen the occasional reports of this for a while,
it's nearly impossible to reproduce, and even with the traffic we see
on python.org/zope.org, I've /never/ seen it there.
    
    CVR> The window of opportunity to trigger it is immensely
    CVR> small. You need two programs to be simultaneously updating
    CVR> the same directory inode, and processing the same slot IN
    CVR> that inode, at the exact same time. We're talking about a
    CVR> latency of, as far as I can tell, 5-15 milliseconds every
    CVR> time post writes a message into qfile, but only if qrunner is
    CVR> actively processing. It looks like Barry took care (from
    CVR> reading the source) to avoid this kind of situation -- but
    CVR> didn't quite lock the window closed.

Unless I still haven't recovered from Maryland's glorious and
long-awaited victory last night[*], I'm surprised this one snuck past
us for so long.  It jumped right out at me when I reviewed the code
again.  Sigh.

Fearing-the-turtle-ly y'rs,
-Barry

[*] NCAA (college) men's basketball national champs.