file.readline() after a seek() breaking up lines
Jeff Epler
jepler at unpythonic.net
Fri Mar 5 16:08:36 EST 2004
When you open a file in text mode, the only offsets that are valid for
'seek()' are ones returned by 'tell()' (or 0, presumably). In practice,
you can seek to arbitrary offsets on most operating systems, though the
results on Windows are confused by the fact that text files store '\n'
as a two-byte sequence. This is what the library reference means when
it says
If the file is opened in text mode (mode 't'), only offsets returned
by tell() are legal. Use of other offsets causes undefined behavior.
http://python.org/doc/lib/bltin-file-objects.html
When you open a file in binary mode, all offsets less than the file
length are valid, but in a text file most of them will be in the middle
of a line. (they're byte offsets into a file you think of as being made
of individual lines)
So, anyway, when you seek to a random offset, you are usually in the middle of a
line, and the first readline() returns that partial line.
You can do one of several things:
* Read the file and gather all line offsets, then pick one of them
(requires reading the whole file each time)
* Read the file in a line at a time and pick the word as you go (If
this is the n'th line, then 1/n of the time replace the "line to be
printed" with this line. At the end of the file, print the line to be
printed)
* Read the file once and write an index of offsets. Then, pick a random
offset from this file, seek to it, and read
* Pick a byte offset, and discard the first line read. You'll never
use the very first line of the file, and longer lines are preferred
over shorter lines (actually, lines *following* longer lines are
preferred...)
* Pick a byte offset and scan backwards until you get to the start of
the file or the start of a line, then readline. Again, longer lines
are preferred over shorter lines by this method
* Create a record-oriented format, so that you can seek to a multiple
of the record length and read a word. All words must be shorter
than reclen.
The old unix "fortune" program used the second method. I'm sure there
are other things you could do as well.
Jeff
More information about the Python-list
mailing list