mixing for x in file: and file.readline

Fri Sep 12 07:07:39 EDT 2003

On Thu, Sep 11, 2003 at 01:54:53PM -0700, Russell E. Owen wrote:
> At one time, mixing for x in file and readline was dangerous. For 
> example:
> 
> for line in file:
>   # read some lines from a file, then break
> nextline = readline() # bad
> 
> would not do what a naive user might expect because the file iterator 
> buffered data and readline did not read from that buffer. Hence the call 
> to readline might unexpectedly skip some lines.
> 
> I stumbled across this the hard way, but am wondering if it's still 
> present in Python 2.3. 

Yes. 

After you start reading a file with 'for' or iter() the current file 
position is undefined unless you continue to the end of the file. This 
means that once you start you shouldn't use the read(), readline() or 
tell() methods unless you first seek() to a well-defined position.

The readline() and read() methods use the buffered I/O operations supplied 
by the underlying C library. You can safely intermix read() and realine() 
as well as tell()ing and seek()ing around without encountering any 
unexpected behavior. You can even mix read operations on the same file 
from Python code and stdio calls from an extension module (after getting
the FILE* object using PyFile_AsFile).  

File iteration uses its own buffering for performance. Guido has declared
that "for line in fileobj:" should always be the fastest way to read an
entire file line by line. You just can't do that with the crappy stdio
implementations out there without adding your own buffering layer. Once
you do that it is out of sync with the FILE* object's idea of the current 
file position. 

In Python 2.2 if you break in the middle of the loop the temporary 
iterator object (xreadlines) is lost along with its readahead buffer, 
leaving you at an unknown file position. The only things you can do are 
to close the file or seek. In Python 2.3 the file object IS an iterator 
(rather than HAS and iterator) so while the current file position is 
undefined from a read/readline/tell point of view the iterator state is
still consistent so you can immediately use it in another for loop to 
continue from the same position or even call its next() method directly.

> Anyone know if it's still an issue? If so, anyone have any idea how hard 
> it would be to fix? I'm willing to work on a patch, but would probably 
> need some help. And if experts have already determined it's too hard, 
> and are willing to expain, I'd love some idea of why that is.

Really fixing it amounts to reimplementing the entire I/O layer of 
Python with a different strategy and thoroughly testing on multiple 
platforms. 

It's possible to hide the problem in most cases by making read and 
readline use the iteration readahead buffer if it's attached to the file
object and stdio if it isn't. I don't think it's a good idea. It will
require some hairy code and and seems susceptible to subtle bugs and
corner cases.

Another alternative it to make read and readline fail noisily after 
iteration starts (unless cleared by seek())

    Oren