[ python-Bugs-1599254 ] mailbox: other programs' messages can vanish without trace

Fri Jan 19 16:24:41 CET 2007

Bugs item #1599254, was opened at 2006-11-19 11:03
Message generated for change (Comment added) made by akuchling
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599254&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.5
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: David Watson (baikie)
Assigned to: A.M. Kuchling (akuchling)
Summary: mailbox: other programs' messages can vanish without trace

Initial Comment:
The mailbox classes based on _singlefileMailbox (mbox, MMDF, Babyl) implement the flush() method by writing the new mailbox contents into a temporary file which is then renamed over the original. Unfortunately, if another program tries to deliver messages while mailbox.py is working, and uses only fcntl() locking, it will have the old file open and be blocked waiting for the lock to become available. Once mailbox.py has replaced the old file and closed it, making the lock available, the other program will write its messages into the now-deleted "old" file, consigning them to oblivion.

I've caused Postfix on Linux to lose mail this way (although I did have to turn off its use of dot-locking to do so).

A possible fix is attached.  Instead of new_file being renamed, its contents are copied back to the original file.  If file.truncate() is available, the mailbox is then truncated to size.  Otherwise, if truncation is required, it's truncated to zero length beforehand by reopening self._path with mode wb+.  In the latter case, there's a check to see if the mailbox was replaced while we weren't looking, but there's still a race condition.  Any alternative ideas?

Incidentally, this fixes a problem whereby Postfix wouldn't deliver to the replacement file as it had the execute bit set.

----------------------------------------------------------------------

>Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-19 10:24

Message:
Logged In: YES 
user_id=11375
Originator: NO

After reflection, I don't think the potential changing actually makes
things any worse.  _generate() always starts numbering keys with 1, so if a
message's key changes because of lock()'s, re-reading, that means someone
else has already modified the mailbox.  Without the ToC clearing, you're
already fated to have a corrupted mailbox because the new mailbox will be
written using outdated file offsets.  With the ToC clearing, you delete the
wrong message.  Neither outcome is good, but data is lost either way.  

The new behaviour is maybe a little bit better in that you're losing a
single message but still generating a well-formed mailbox, and not a
randomly jumbled mailbox.

I suggest applying the patch to clear self._toc, and noting in the
documentation that keys might possibly change after doing a lock().

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2007-01-18 13:15

Message:
Logged In: YES 
user_id=1504904
Originator: YES

This version passes the new tests (it fixes the length checking bug,
and no longer clears self._toc), but goes back to failing
test_concurrent_add.

File Added: mailbox-unified2-module.diff

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2007-01-18 13:14

Message:
Logged In: YES 
user_id=1504904
Originator: YES

Unfortunately, there is a problem with clearing _toc: it's basically
the one I alluded to in my comment of 2006-12-16.  Back then I thought
it could be caught by the test you issue the warning for, but the
application may instead do its second remove() *after* the lock() (so
that self._pending is not set at lock() time), using the key carried
over from before it called unlock().  As before, this would result in
the wrong message being deleted.

I've added a test case for this (diff attached), and a bug I found in
the process whereby flush() wasn't updating the length, which could
cause subsequent flushes to fail (I've got a fix for this).  These
seem to have turned up a third bug in the MH class, but I haven't
looked at that yet.

File Added: mailbox-unified2-test.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-17 16:06

Message:
Logged In: YES 
user_id=11375
Originator: NO

Add mailbox-unified-patch.

File Added: mailbox-unified-patch.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-17 16:05

Message:
Logged In: YES 
user_id=11375
Originator: NO

mailbox-pending-lock is the difference between David's copy-back-new +
fcntl-warning and my -unified patch, uploaded so that he can comment on the
changes.

The only change is to make _singleFileMailbox.lock() clear self._toc,
forcing a re-read of the mailbox file.  If _pending is true, the ToC isn't
cleared and a warning is logged.  I think this lets existing code run
(albeit possibly with a warning if the mailbox is modified before .lock()
is called), but fixes the risk of missing changes written by another
process.

Triggering a new warning is sort of an API change, but IMHO it's still
worth backporting; code that triggers this warning is at risk of losing
messages or corrupting the mailbox.

Clearing the _toc on .lock() is also sort of an API change, but it's
necessary to avoid the potential data loss.  It may lead to some redundant
scanning of mailbox files, but this price is worth paying, I think; people
probably don't have 2Gb mbox files (I hope not, anyway!) and no extra read
is done if you create the mailbox and immediately lock it before looking
anything up.

Neal: if you want to discuss this patch or want an explanation of
something, feel free to chat with me about it.

I'll wait a day or two and see if David spots any problems.  If nothing
turns up, I'll commit it to both trunk and release25-maint.

File Added: mailbox-pending-lock.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-17 15:53

Message:
Logged In: YES 
user_id=11375
Originator: NO

mailbox-unified-patch contains David's copy-back-new and fcntl-warn
patches, plus the test-mailbox patch and some additional changes to
mailbox.py from me.  (I'll upload a diff to David's diffs in a minute.)

This is the change I'd like to check in.  test_mailbox.py now passes, as
does the mailbox-break.py script I'm using.

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-17 14:56

Message:
Logged In: YES 
user_id=11375
Originator: NO

Committed a modified version of the doc. patch in rev. 53472 (trunk) and
rev. 53474 (release25-maint).

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-01-17 01:48

Message:
Logged In: YES 
user_id=33168
Originator: NO

Andrew, do you need any help with this?

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-15 14:01

Message:
Logged In: YES 
user_id=11375
Originator: NO

Comment from Andrew MacIntyre (os2vacpp is the OS/2 that lacks
ftruncate()):

================

I actively support the OS/2 EMX port (sys.platform == "os2emx"; build
directory is PC/os2emx).  I would like to keep the VAC++ port alive, but
the reality is I don't have the resources to do so.  The VAC++ port was
the subject of discussion about removal of build support support from the
source tree for 2.6 - I don't recall there being a definitive outcome,
but if someone does delete the PC/os2vacpp directory, I'm not in a
position to argue.

AMK: (mailbox.py has a separate section of code used when file.truncate()
isn't available, and the existence of this section is bedevilling me.
It would be convenient if platforms without file.truncate() weren't a
factor; then this branch could just be removed.  In your opinion,
would it hurt OS/2 users of mailbox.py if support for platforms
without file.truncate() was removed?)

aimacintyre: No.  From what documentation I can quickly check, ftruncate()
operates
on file descriptors rather than FILE pointers.  As such I am sure that,
if it became an issue, it would not be difficult to write a ftruncate()
emulation wrapper for the underlying OS/2 APIs that implement the
required functionality.

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2007-01-13 13:32

Message:
Logged In: YES 
user_id=1504904
Originator: YES

I like the warning idea - it seems appropriate if the problem is
relatively rare.  How about this?

File Added: mailbox-fcntl-warn.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-12 14:41

Message:
Logged In: YES 
user_id=11375
Originator: NO

One OS/2 port lacks truncate(), and so does RISCOS.

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-12 13:41

Message:
Logged In: YES 
user_id=11375
Originator: NO

I realized that making flush() invalidate keys breaks the final example in
the docs, which loops over inbox.iterkeys() and removes messages, doing a
pack() after each message.

Which platforms lack file.truncate()?  Windows has it; POSIX has it, so
modern Unix variants should all have it.  Maybe mailbox should simply raise
an exception (or trigger a warning?) if truncate is missing, and we should
then assume that flush() has no effect upon keys.

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-12 12:12

Message:
Logged In: YES 
user_id=11375
Originator: NO

So shall we document flush() as invalidating keys, then?

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2007-01-06 14:57

Message:
Logged In: YES 
user_id=1504904
Originator: YES

Oops, length checking had made the first two lines of this patch
redundant; update-toc applies OK with fuzz.

File Added: mailbox-copy-back-new.diff

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2007-01-06 10:30

Message:
Logged In: YES 
user_id=1504904
Originator: YES

File Added: mailbox-copy-back-53287.diff

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2007-01-06 10:24

Message:
Logged In: YES 
user_id=1504904
Originator: YES

Aack, yes, that should be _next_user_key.  Attaching a fixed version.

I've been thinking, though: flush() does in fact invalidate the keys
on platforms without a file.truncate(), when the fcntl() lock is
momentarily released afterwards.  It seems hard to avoid this as,
perversely, fcntl() locks are supposed to be released automatically on
all file descriptors referring to the file whenever the process closes
any one of them - even one the lock was never set on.

So, code using mailbox.py such platforms could inadvertently be
carrying keys across an unlocked period, which is not made safe by the
update-toc patch (as it's only meant to avert disasters resulting from
doing this *and* rebuilding the table of contents, *assuming* that
another process hasn't deleted or rearranged messages).

File Added: mailbox-update-toc-fixed.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-05 14:51

Message:
Logged In: YES 
user_id=11375
Originator: NO

Question about mailbox-update-doc: the add() method still returns
self._next_key - 1; should this be 
self._next_user_key - 1?  The keys in _user_toc are the ones returned to
external users of the mailbox, right?

(A good test case would be to initialize _next_key to 0 and _next_user_key
to a different value like 123456.)

I'm still staring at the patch, trying to convince myself that it will
help -- haven't spotted any problems, but this bug is making me nervous...

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2007-01-05 14:24

Message:
Logged In: YES 
user_id=11375
Originator: NO

As a step toward improving matters, I've attached the suggested doc patch
(for both 25-maint and trunk).  It encourages people to use Maildir :),
explicitly states that modifications should be bracketed by lock(), and
fixes the examples to match.

It does not say that keys are invalidated by doing a flush(), because
we're going to try to avoid the necessity for that.

File Added: mailbox-docs.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2006-12-20 14:48

Message:
Logged In: YES 
user_id=11375
Originator: NO

Committed length-checking.diff to trunk in rev. 53110.

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2006-12-20 14:19

Message:
Logged In: YES 
user_id=1504904
Originator: YES

File Added: mailbox-test-lock.diff

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2006-12-20 14:17

Message:
Logged In: YES 
user_id=1504904
Originator: YES

Yeah, I think that should definitely go in.  ExternalClashError or a
subclass sounds fine to me (although you could make a whole taxonomy
of these errors, really).  It would be good to have the code actually
keep up with other programs' changes, though; a program might just
want to count the messages at first, say, and not make changes until
much later.

I've been trying out the second option (patch attached, to apply on
top of mailbox-copy-back), regenerating _toc on locking, but
preserving existing keys.  The patch allows existing _generate_toc()s
to work unmodified, but means that _toc now holds the entire last
known contents of the mailbox file, with the 'pending' (user-visible)
mailbox state being held in a new attribute, _user_toc, which is a
mapping from keys issued to the program to the keys of _toc
(i.e. sequence numbers in the file).  When _toc is updated, any new
messages that have appeared are given keys in _user_toc that haven't
been issued before, and any messages that have disappeared are removed
from it.  The code basically assumes that messages with the same
sequence number are the same message, though, so even if most cases
are caught by the length check, programs that make
deletions/replacements before locking could still delete the wrong
messages.  This behaviour could be trapped, though, by raising an
exception in lock() if self._pending is set (after all, code like that
would be incorrect unless it could be assumed that the mailbox module
kept hashes of each message or something).

Also attached is a patch to the test case, adding a lock/unlock around
the message count to make sure _toc is up-to-date if the parent
process finishes first; without it, there are still intermittent
failures.

File Added: mailbox-update-toc.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2006-12-20 09:46

Message:
Logged In: YES 
user_id=11375
Originator: NO

Attaching a patch that adds length checking: before doing a flush() on a
single-file mailbox, seek to the end and verify its length is unchanged. 
It raises an ExternalClashError if the file's length has changed.  (Should
there be a different exception for this case, perhaps a subclass of
ExternalClashError?)

I verified that this change works by running a program that added 25
messages, pausing between each one, and then did 'echo "new line" >
/tmp/mbox' from a shell while the program was running.

I also noticed that the self._lookup() call in self.flush() wasn't
necessary, and replaced it by an assertion.

I think this change should go on both the trunk and 25-maint branches.

File Added: length-checking.diff

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2006-12-18 12:43

Message:
Logged In: YES 
user_id=11375
Originator: NO

Eep, correct; changing the key IDs would be a big problem for existing
code.  We could say 'discard all keys' after doing lock() or unlock(), but
this is an API change that means the fix couldn't be backported to
2.5-maint.

We could make generating the ToC more complicated, preserving key IDs when
possible; that may not be too difficult, though the code might be messy.

Maybe it's best to just catch this error condition: save the size of the
mailbox, updating it in _append_message(), and then make .flush() raise an
exception if the mailbox size has unexpectedly changed.

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2006-12-16 14:09

Message:
Logged In: YES 
user_id=1504904
Originator: YES

Yes, I see what you mean.  I had tried multiple flushes, but only
inside a single lock/unlock.  But this means that in the no-truncate()
code path, even this is currently unsafe, as the lock is momentarily
released after flushing.

I think _toc should be regenerated after every lock(), as with the
risk of another process replacing/deleting/rearranging the messages,
it isn't valid to carry sequence numbers from one locked period to
another anyway, or from unlocked to locked.  However, this runs the
risk of dangerously breaking code that thinks it is valid to do so,
even in the case where the mailbox was *not* modified (i.e. turning
possible failure into certain failure).  For instance, if the program
removes message 1, then as things stand, the key "1" is no longer
used, and removing message 2 will remove the message that followed 1.
If _toc is regenerated in between, however (using the current code, so
that the messages are renumbered from 0), then the old message 2
becomes message 1, and removing message 2 will therefore remove the
wrong message.  You'd also have things like pending deletions and
replacements (also unsafe generally) being forgotten.  So it would
take some work to get right, if it's to work at all...

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2006-12-15 09:06

Message:
Logged In: YES 
user_id=11375
Originator: NO

I'm testing the fix using two Python processes running mailbox.py, and my
test case fails even with your patch.  This is due to another bug, even in
the patched version.  

mbox has a dictionary attribute, _toc, mapping message keys to positions
in the file.  flush() writes out all the messages in self._toc and
constructs a new _toc with the new file offsets.  It doesn't re-read the
file to see if new messages were added by another process.

One fix that seems to work: instead of doing 'self._toc = new_toc' after
flush() has done its work, do self._toc = None.  The ToC will be
regenerated the next time _lookup() is called, causing a re-read of all the
contents of the mbox.  Inefficient, but I see no way around the necessity
for doing this.

It's not clear to me that my suggested fix is enough, though.  Process #1
opens a mailbox, reads the ToC, and the process does something else for 5
minutes.  In the meantime, process #2 adds a file to the mbox.  Process #1
then adds a message to the mbox and writes it out; it never notices process
#2's change.

Maybe the _toc has to be regenerated every time you call lock(), because
at this point you know there will be no further updates to the mbox by any
other process.  Any unlocked usage of _toc should also really be
regenerating _toc every time, because you never know if another process has
added a message... but that would be really inefficient.

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2006-12-15 08:17

Message:
Logged In: YES 
user_id=11375
Originator: NO

The attached patch adds a test case to test_mailbox.py that demonstrates
the problem.  No modifications to mailbox.py are needed to show data loss.

Now looking at the patch...

File Added: mailbox-test.patch

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2006-12-12 16:04

Message:
Logged In: YES 
user_id=11375
Originator: NO

I agree with David's analysis; this is in fact a bug.  I'll try to look at
the patch.

----------------------------------------------------------------------

Comment By: David Watson (baikie)
Date: 2006-11-19 15:44

Message:
Logged In: YES 
user_id=1504904
Originator: YES

This is a bug.  The point is that the code is subverting the protection of
its own fcntl locking.  I should have pointed out that Postfix was still
using fcntl locking, and that should have been sufficient.  (In fact, it
was due to its use of fcntl locking that it chose  precisely the wrong
moment to deliver mail.)  Dot-locking does protect against this, but not
every program uses it - which is precisely the reason that the code
implements fcntl locking in the first place.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-11-19 15:02

Message:
Logged In: YES 
user_id=21627
Originator: NO

Mailbox locking was invented precisely to support this kind of operation.
Why do you complain that things break if you deliberately turn off the
mechanism preventing breakage?

I fail to see a bug here.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599254&group_id=5470