Possible read()/readline() bug?

Thu Oct 23 03:20:32 EDT 2008

On Oct 22, 2:54 pm, Mike Kent <mrmak... at cox.net> wrote:
> Before I file a bug report against Python 2.5.2, I want to run this by
> the newsgroup to make sure I'm not being stupid.
>
> I have a text file of fixed-length records I want to read in random
> order.  That file is being changed in real-time by another process,
> and my process want to see the changes to the file.  What I'm seeing
> is that, once I've opened the file and read a record, all subsequent
> seeks to and reads of that same record will return the same data as
> the first read of the record, so long as I don't close and reopen the
> file.  This indicates some sort of buffering and caching is going on.
>
> Consider the following:
>
> $ echo "hi" >foo.txt  # Create my test file
> $ python2.5              # Run Python
> Python 2.5.2 (r252:60911, Sep 22 2008, 16:13:07)
> [GCC 3.4.6 20060404 (Red Hat 3.4.6-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
> >>> f = open('foo.txt')  # Open my test file
> >>> f.seek(0)                # Seek to the beginning of the file
> >>> f.readline()             # Read the line, I get the data I expected
> 'hi\n'
> >>> # At this point, in another shell I execute 'echo "bye" >foo.txt'.  'foo.txt' now has been changed

I thought this might be a case where the shell unlinks foo.txt and
creates a new file... but it doesn't for me, and I still get the same
behavior as you.  It is indeed the buffering that's causing this.

> >>> # on the disk, and now contains 'bye\n'.
> >>> f.seek(0)                # Seek to the beginning of the still-open file
> >>> f.readline()             # Read the line, I don't get 'bye\n', I get the original data, which is no longer there.
> 'hi\n'
> >>> f.close()                 # Now I close the file...
> >>> f = open('foo.txt') # ... and reopen it
> >>> f.seek(0)               # Seek to the beginning of the file
> >>> f.readline()            # Read the line, I get the expected 'bye\n'
> 'bye\n'
>
> It seems pretty clear to me that this is wrong.  If there is any
> caching going on, it should clearly be discarded if I do a seek.

I totally disagree.  If you need to discard the buffers, there's a way
to do it: flush().  If you force seek() to discard perfectly good
buffers you will hurt performance when not dealing with volatile data.

Anyway, in Python 2.x, the behavior of the various file methods is
documented as reflecting the underlying C stdio library.  In fact, the
documentation for fseek specifically says it sets the file's current
position "like stdio's fseek()".  Whatever stdio does is what Python
does.  So even if this behavior were a bug, it would be a bug in
stdio, not in Python.

> Note
> that it's not just readline() that's returning me the wrong, cached
> data, as I've also tried this with read(), and I get the same
> results.  It's not acceptable that I have to close and reopen the file
> before every read when I'm doing random record access.

You can call f.flush() to force it to discard the cache.  Or use
unbuffered I/O.  Better yet, get rid of file I/O altogether and use an
memory mapped file.

> So, is this a bug, or am I being stupid?

Well, it's not a bug, so....

Seriously, I advise you not to submit a bug report.  Doesn't mean
you're stupid, maybe you didn't know about unbuffered I/O or the
flush() method.  That just means you're uneducated. :)   But please
leave seek() out it.

Carl Banks