[Python-Dev] thread semantics for file objects

Fri Mar 18 00:14:46 CET 2005

[Jeremy Hylton]
...
> Universal newline reads and get_line() both lock the stream if the
> platform supports it.  So I expect that they are atomic on those
> platforms.

Well, certainly not get_line().  That locks and unlocks the stream
_inside_ an enclosing for-loop.  Looks quite possible for different
threads to read different parts of "the same line" if multiple threads
are trying to do get_line() simultaneously.  It releases the GIL
inside the for-loop too, so other threads _can_ sneak in.

We put a lot of work into speeding those getc()-in-a-loop functions. 
There was undocumented agreement at the time that they "should be"
thread-safe in this sense:  provided the platform C stdio wasn't
thread-braindead, then if you had N threads all simultaneously reading
a file object containing B bytes, while nobody wrote to that file
object, then the total number of bytes seen by all N threads would sum
to B at the time they all saw EOF.  This was a much stronger guarantee
than Perl provided at the time (and, for all I know, still provides),
and we (at least I) wrote little test programs at the time
demonstrating that the total number of bytes Perl saw in this case was
unpredictable, while Python's did sum to B.

Of course Perl didn't document any of this either, and it Pythonland
was clearly specific to the horrid tricks in CPython's fileobject.c.

> But it certainly seems safe to conclude this is a quality of
> implementation issue.

Or a sheer pigheadness-of-implementor issue <wink>.

>  Otherwise, why bother with the flockfile() at all, right?  Or is there some
> correctness issue I'm not seeing that requires the locking for some basic
> safety in the implementation.

There are correctness issues, but we still ignore them; locking
relieves, but doesn't solve, them.  For example, C doesn't (and POSIX
doesn't either!) define what happens if you mix reads with writes on a
file opened for update unless a file-positioning operation (like seek)
intervenes, and that's pretty easy for threads to run afoul of. 
Python does nothing to stop you from trying, and behavior if you do is
truly all over the map across boxes.  IIRC, one of the multi-threaded
test programs I mentioned above provoked ugly death in the bowels of
MS's I/O libraries when I threw an undisciplined writer thread into
the mix too.  This was reported to MS, and their response was "so
don't that -- it's undefined".  Locking the stream at least cuts down
the chance of that happening, although that's not the primary reason
for it.

Heck, we still have a years-open critical bug against segfaults when
one thread tries to close a file object while another threading is
reading from it, right?

>>>     And even using a lock is stupid.

>> ZODB's FileStorage is bristling with locks protecting multi-threaded
>> access to file objects, therefore that can't be stupid.  QED

> Using a lock seemed like a good idea there and still seems like a good
> idea now :-).

Damn straight, and we're certain it has nothing to do with those large
runs of NUL bytes that sometime overwrite peoples' critical data for
no reason at all <wink>.