file.readline() after a seek() breaking up lines

Fri Mar 5 16:08:36 EST 2004

When you open a file in text mode, the only offsets that are valid for
'seek()' are ones returned by 'tell()' (or 0, presumably).  In practice,
you can seek to arbitrary offsets on most operating systems, though the
results on Windows are confused by the fact that text files store '\n'
as a two-byte sequence.  This is what the library reference means when
it says
    If the file is opened in text mode (mode 't'), only offsets returned
    by tell() are legal. Use of other offsets causes undefined behavior.
        http://python.org/doc/lib/bltin-file-objects.html

When you open a file in binary mode, all offsets less than the file
length are valid, but in a text file most of them will be in the middle
of a line. (they're byte offsets into a file you think of as being made
of individual lines)

So, anyway, when you seek to a random offset, you are usually in the middle of a
line, and the first readline() returns that partial line.

You can do one of several things:
* Read the file and gather all line offsets, then pick one of them
  (requires reading the whole file each time)
* Read the file in a line at a time and pick the word as you go (If
  this is the n'th line, then 1/n of the time replace the "line to be
  printed" with this line.  At the end of the file, print the line to be
  printed)
* Read the file once and write an index of offsets.  Then, pick a random
  offset from this file, seek to it, and read
* Pick a byte offset, and discard the first line read.  You'll never
  use the very first line of the file, and longer lines are preferred
  over shorter lines (actually, lines *following* longer lines are
  preferred...)
* Pick a byte offset and scan backwards until you get to the start of
  the file or the start of a line, then readline.  Again, longer lines
  are preferred over shorter lines by this method
* Create a record-oriented format, so that you can seek to a multiple
  of the record length and read a word.  All words must be shorter
  than reclen.

The old unix "fortune" program used the second method.  I'm sure there
are other things you could do as well.

Jeff