Puzzled: Y am I ending up with extra bytes?

Tim Peters tim.one at comcast.net
Sat Feb 23 17:31:29 EST 2002


[A.Newby]
> Why is this happening? I read large chunks of data from a text file,
> according to byte locations specified on another file, and for
> some reason, this function (below), spits out a few extra bytes.
>
> Here's the code, as entered into the Python shell......
>
> ...
> 	    	chat = open('D:\cgi-bin\log.txt', 'r')

Bingo.  From the file name, you're running on Windows.  Change 'r' to 'rb'
and see what happens.

Usenet doesn't have enough space <wink> to explain what Windows does with
seeks for files opened in text mode.  ssek/tell act very strangely with text
mode files.

Note that, after opening it in binary mode instead, you're going to have
deal with Windows '\r\n' line endings yourself (indeed, the magical
transformation of \r\n to \n when you open it in text mode is the cause of
your problems (well, it's a primary cause, anyway)).

Another caution:

>    	#this opens the index file, which has precise byte locations of each
>    	#chunk of data I want to extract from the log.txt file,

You're going to have to think carefully about what "precise" means,
precisely.  If you didn't mean "precise" in the sense of the actual number
of bytes in the file *as stored on disk* (including 2 bytes for each line
end), then opening in binary mode is going to screw up too, because the
index is in fact lying about the disk file then.  The combination of "actual
byte counts" and "binary mode" works fine.

If you try to stick to text mode, you can get *parts* of it to work, by
ensuring the index contains only numbers returned directly by tell() when
the file was open in text mode.  But doing arithmetic on such numbers won't
always act the way you hope.  This isn't Python screwing you, it's how C
defines fseek and ftell in text mode:  tell doesn't necessarily return a
physical byte count for files opened in text mode, and on Windows it
actually doesn't.  Likewise for a file opened in text mode, the *only*
arguments to seek() that are guaranteed to work are

    seek(0, SEEK_SET)  # start of file
    seek(0, SEEK_END)  # end of file
    seek(0, SEEK_CUR)  # don't move at all
    seek(i, SEEK_SET)  # restore previous position, where i is exactly
                       # the value returned by a previous tell

Note too that if i and j are two values returned by a text-mode tell,
there's no guarantee that i-j (or j-i) bears any relation to the number of
physical bytes beteen the file positions they represent (and, on Windows,
they don't).  If you want to do arithmetic on tell() results, it will screw
you over and over unless you use binary mode (both when obtaining offsets
via tell(), and when using offsets with seek()).





More information about the Python-list mailing list