[Q] File Object -- function 'read'

John Machin machin_john_888 at hotmail.com
Fri Jul 20 05:39:59 EDT 2001


Peter Hansen <peter at engcorp.com> wrote in message news:<3B579358.DCF39F49 at engcorp.com>...
> 
> The main difference between the two, is that on Windows
> (and DOS, and anything else in that ugly realm), "text" files
> have 'newlines' (represented as \n but really just LF or 
> linefeed characters, or \x0a) converted to pairs of bytes,
> specifically a Carriage Return (CR) followed by a Line Feed
> (LF).  
> 
You didn't mention the no-pejorative-supplied Macintosh realm which
uses a lone CR and the IBM mainframe realm which is something else
again and there's also Unicode which would like you to use LS (line
separator) and PS (paragraph separator) to say what you really mean
... see Unicode Tech Report #13 for a long list of instructions like
"if you get byte X off system Y and the month has an 'r' in it and the
programmer's name was Fred, then map X into Z; else if ...".

... and there was in a previous job a data supplier who reliably every
month gave us records separated by not CRLF but LFCR; this tended to
send Windows software into gibbering heaps.

There is a much more hideous convention in DOS/Windows. An embedded
control-Z is taken to mean end-of-file. This doesn't suck, it bites,
and bites badly, needing such as the following (where the docstring is
lightly edited to protect the guilty):

def cleaned_lines(fh):
   """
   Return a list of lines from the file object fh
   (which must have been opened in binary mode).
   CR (0x0D) and LF (0x0A) will be used to split the file
   into lines. Any other characters outside the range
   0x20 to 0x7E (both inclusive) (i.e characters that are not
   printable ASCII characters) will be changed into "?".
   
   This is all because XXXXXXX don't validate their input. One
   employee has a ctrl-Z character in his YYYYY field. Ctrl-Z
   is the EOF (end-of-file) marker in the old CP/M operating
   system. This nasty idea of a data character meaning EOF was
   inherited by MS-DOS & MS Windows and makes much software
   stop reading a file prematurely in text mode if there is a
   ctrl-Z incorrectly embedded in the data.
   """
   return fh.read().translate(translate_string).splitlines()


By the way, is there a prize for largest ratio of doc to code?



More information about the Python-list mailing list