[I18n-sig] codecs module, readlines and xreadlines

Scott David Daniels Scott.Daniels@Acm.Org
Thu, 16 Jan 2003 13:45:52 -0800


M.-A. Lemburg wrote:
> Poor Yorick wrote:
>> so to be consistent, codecs.open.readlines should do the same thing 
>> and remove \x00\r when the file is opened in text mode.
> But only on Windows, right ? (On Unix text mode and binary mode
> behave identically)

Actually, on Apple's systems, lines are delimitted with \r, removing
the \n.  As painful as it is for me to acknowledge this, Microsoft
is actually the most standards-compliant of the three major
interpretation. C (and hence Unix) considered that it was redundant
to have two distinct characters indicating end-of line.

The unix choice was the only irreversible character of the pair
(the line-feed).  For a while, MIT had a non-standard control
character that they called the "line-starve" which reversed the
effect of the line feed.  On the old teletype model 33s, though,
the line feed was irreversible, while the carriage return was
simple horiozontal postioning (and equivalent to the appropriate
number of backspaces.

Apple, I suspect, was thinking of the analogue to the keyboard.
Very few typists ever type the line feed character; they type
a return which emits the \r character.  Unix solves this by
conversion if the terminal is not in "raw" mode; Apple doesn't
have to make a distinction.

The least reasonable (but most standard-conforming) choice is \r\n, 
which (if you interpret the early ASCII standards literally),
should be interpretted the same as \n\r.  It is also uncomfortably
true that \r\n\n should be exactly equivalent to \r\n\r\n.  So, a
lot of code is simplified if there is a single EOL (End-Of-Line)
character.  C declared this so, and anyone who does not use LF (\n)
as a line delimiter in the environment where their C runtimes work
is supposed to translate their local convention to the C-standard
in the I/O runtimes.

To summarize briefly, after being hopelessly long-winded, Apple
non-raw should probably convert \r to \n, Microsoft non-raw
should similarly convert \r\n to \n.  What should be done in
non-binary mode for the other line terminators in UniCode (I
_think_ some exist) might be a source of hopelessly long-winded
debate.