Writing a Carriage Return in Unicode

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Nov 21 03:12:51 EST 2009


On Thu, 19 Nov 2009 23:22:22 -0800, Scott David Daniels wrote:

> MRAB wrote:
>> u'\u240D' isn't a carriage return (that's u'\r') but a symbol (a
>> visible "CR" graphic) for carriage return. Windows programs normally
>> expect lines to end with '\r\n'; just use u'\n' in programs and open
>> the text files in text mode ('r' or 'w').
> 
> <rant>
> This is the one thing from standards that I believe Microsoft got right
> where others did not.

Oh please, that's historical revisionism -- \r\n wasn't invented by 
Microsoft. Microsoft didn't "get it right", they simply copied what CP/M 
did, on account of the original MS-DOS being essentially a clone of CP/M.

And of course the use of \r\n predates computers -- CR+LF (Carriage 
Return + LineFeed) were necessary to instruct the print head on teletype 
printers to move down one line and return to the left. It was a physical 
necessity for the oldest computer operating systems, because the only 
printers available were teletypes.


> The ASCII (American Standard for Information
> Interchange) standard end of line is _both_ carriage return (\r) _and_
> line feed (\n)

I doubt that very much. Do you have a reference for this?

It is true that the predecessor to ANSI (not ASCII), ASA, specified \r\n 
as the line terminator, but ISO specified that both \n and \r\n should be 
accepted.


> I believe in that order.

You "believe" in that order? But you're not sure?

That's the trouble with \r\n, or \n\r -- it's an arbitrary choice, and 
therefore hard to remember which it is. I've even seen proprietary 
business-to-business software where the developers (apparently) couldn't 
remember which was the standard, so when exporting data to text, you had 
to choose which to use for line breaks.

Of course, being Windows software, they didn't think that you might want 
to transfer the text file to a Unix system, or a Mac, and so didn't offer 
\n or \r alone as line terminators.


> The Unix operating system, in its enthusiasm to make _everything_
> simpler (against Einstein's advice, "Everything should be made as simple
> as possible, but not simpler.") decided that end-of-line should be a
> simple line feed and not carriage return line feed.

Why is it "too simple" to have line breaks be a single character? What is 
the downside of the Unix way? Why is \r\n "better"? We're not using 
teletypes any more.

Or for that matter, classic Mac OS, which used a single \r as newline.

Likewise for other OSes, such as Commodore, Amiga, Multics...


> Before they made
> that decision, there was debate about the order of cr-lf or lf-cr, or
> inventing a new EOL character ('\037' == '\x1F' was the candidate).

IBM operating systems that use EBCDIC used the NEL (NExt Line) character 
for line breaks, keeping CR and LF for other uses. 

The Unicode standard also specifies that any of the following be 
recognised as line separators or terminators:

LF, CR, CR+LF, NEL, FF (FormFeed, \f), LS (LineSeparator, U+2028) and PS 
(ParagraphSeparator, U+2029).


> If you've actually typed on a physical typewriter, you know that moving
> the carriage back is a distinct operation from rolling the platen
> forward; 

I haven't typed on a physical typewriter for nearly a quarter of a 
century.

If you've typed on a physical typewriter, you'll know that to start a new 
page, you have to roll the platen forward until the page ejects, then 
move the typewriter guide forward to leave space, then feed a new piece 
of paper into the typewriter by hand, then roll the platen again until 
the page is under the guide, then push the guide back down again. That's 
FIVE distinct actions, and if you failed to do them, you would type but 
no letters would appear on the (non-existent) page. Perhaps we should 
specify that text files need a five-character sequence to specify a new 
page too?


> both operations are accomplished when you push the carriage
> back using the bar, but you know they are distinct.  Hell, MIT even had
> "line starve" character that moved the cursor up (or rolled the platen
> back).
> </rant>
> 
> Lots of people talk about "dos-mode files" and "windows files" as if
> Microsoft got it wrong; it did not -- Unix made up a convenient fiction
> and people went along with it. (And, yes, if Unix had been there first,
> their convention was, in fact, better).

This makes zero sense. If Microsoft "got it right", then why is the Unix 
convention "convenient" and "better"? Since we're not using teletype 
machines, I would say Microsoft is now using an *inconvenient* fiction.




-- 
Steven



More information about the Python-list mailing list