Reading in cooked mode (was Re: Python MSI not installing, log file showing name of a Viatnemese communist revolutionary)

Sun Mar 23 22:37:32 EDT 2014

On Sun, 23 Mar 2014 12:37:43 +1100, Chris Angelico wrote:

> On Sun, Mar 23, 2014 at 12:07 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Sun, 23 Mar 2014 02:09:20 +1100, Chris Angelico wrote:
>>
>>> On Sun, Mar 23, 2014 at 1:50 AM, Steven D'Aprano
>>> <steve+comp.lang.python at pearwood.info> wrote:
>>>> Line endings are terminators: they end the line. Whether you consider
>>>> the terminator part of the line or not is a matter of opinion (is the
>>>> cover of a book part of the book?) but consider this:
>>>>
>>>>     If you say that the end of lines are *not* part of the line, then
>>>>     that implies that some parts of the file are not inside any line
>>>>     at all. And that would be just weird.
>>>
>>> Not so weird IMO. A file is not a concatenation of lines; it is a
>>> stream of bytes.
>>
>> But a *text file* is a concatenation of lines. The "text file" model is
>> important enough that nearly all programming languages offer a
>> line-based interface to files, and some (Python at least, possibly
>> others) make it the default interface so that iterating over the file
>> gives you lines rather than bytes -- even in "binary" mode.
> 
> And lines are delimited entities. A text file is a sequence of lines,
> separated by certain characters.

Are they really separated, or are they terminated?

    a\nb\n

Three lines or two? If you say three, then you consider \n to be a 
separator; if you say two, you consider it a terminator.

The thing is, both points of view are valid. If \n is a terminator, then 
the above is valid text, but this may not be:

    a\nb\nc

since the last line is unterminated. (You might be generous and allow 
that every line must be terminated except possibly the last. Or you might 
be strict and consider the last line to be broken.)

In practice, most people swap between one point of view and the other 
without warning: I might say that "a\nb\n" has two lines terminated with 
\n, and then an instant later say that the file ends with a blank line, 
which means it has three lines, not two. Or you might say that "a\nb\n" 
has three lines separated by \n, and an instant later claim that the last 
line contains the letter "b". So common language about text files tends 
to be inconsistent and flip-flop between the two points of view, a bit 
like the Necker Cube optical illusion.

Given that the two points of view are legitimate and useful, how should a 
programming language treat lines? If the language treats the newline as 
separator, and strips it, then those who want to treat it as terminator 
are screwed -- you cannot tell if the last line is terminated or not. But 
if the language treats the newline as a terminator, and so part of the 
line, it is easy for the caller to remove it. The decision ought to be a 
no-brainer: keep the newline in place, let the user strip it if they 
don't want it.

Here's another thought for you: words are separated by spaces. Nobody 
ever considers the space to be part of the word[1]. I think that nearly 
everyone agrees that both "spam eggs" and "spam      eggs" contain two 
words, "spam" and "eggs". I don't think anyone would say that the second 
example includes seven words, five of which are blank. Would we like to 
say that "spam\n\n\n\n\n\neggs" contains two lines rather than seven?

>>> (Both interpretations make sense. I just wish the most obvious form of
>>> iteration gave the cleaner/tidier version, or at very least that there
>>> be some really obvious way to ask for lines-without-endings.)
>>
>> There is: call strip('\n') on the line after reading it. Perl and Ruby
>> spell it chomp(). Other languages may spell it differently. I don't
>> know of any language that automatically strips newlines, probably
>> because you can easily strip the newline from the line, but if the
>> language did it for you, you cannot reliably reverse it.
> 
> That's not a tidy way to iterate, that's a way to iterate and then do
> stuff. Compare:
> 
> for line in f:
>     # process line with newline
> 
> for line in f:
>     line = line.strip("\n")
>     # process line without newline, as long as it doesn't have \r\n or
>     something

With universal newline support, you can completely ignore the difference 
in platform-specific end-of-line markers. By default, Python will convert 
them to and from \n when you read or write a text file, and you'll never 
see any difference. Just program using \n in your source code, and let 
Python do the right thing. (If you need to handle end of line markers 
yourself, you can easily disable universal newline support.)

f = (line.rstrip('\n') for line in f)
for line in f:
    # process line

Everything[1] in computer science can be solved by an additional layer of 
indirection :-)

[...]
> So why is the delimiter excluded when you treat the file as CSV, but
> included when you treat the file as lines of text?

Because reading lines of text is more general than reading CSV records. 
Therefore it has to make fewer modifications to the raw content.

I once had a Pascal compiler that would insert spaces, indentation, even 
change the case of words. Regardless of what you actually typed, it would 
pretty-print your code, then write the pretty-printed output when you 
saved. Likewise, if you read in a Pascal source file from an external 
editor, then saved it, it would overwrite the original with it's pretty-
printed version. That sort of thing may or may not be appropriate for a 
high-level tool which is allowed to impose whatever structure it likes on 
its data files, but it would be completely inappropriate for a low-level 
almost-raw process (more like lightly blanched than cooked) like reading 
from a text file in Python.

>> Reading from a file is considered a low-level operation. Reading
>> individual bytes in binary mode is the lowest level; reading lines in
>> text mode is the next level, built on top of the lower binary mode. You
>> build higher protocols on top of one or the other of that mode, e.g.
>> "read a zip file" would be built on top of binary mode, "read a csv
>> file" would be built on top of text mode.
> 
> I agree that reading a binary file is the lowest level. Reading a text
> file is higher level, but to me "reading a text file" means "reading a
> binary file and decoding it into Unicode text", and not "... and
> dividing it into lines". Bear in mind that reading a CSV file can be
> built on top of a Unicode decode, but not on a line-based iteration (in
> case there are newlines inside quotes).

Of course you can build a CSV reader on top of line-based iteration. You 
just need an accumulator inside your parser: if, at the end of the line, 
you are still inside a quoted field, keep processing over the next line.

>> As a low-level protocol, you ought to be able to copy a file without
>> changing it by reading it in then writing it out:
>>
>> for blob in infile:
>>     outfile.write(blob)
>>
>>
>> ought to work whether you are in text mode or binary mode, so long as
>> the infile and outfile are opened in the same mode. If Python were to
>> strip newlines, that would no longer be the case.
> 
> All you need is a "writeln" method that re-adds the newline, and then
> it's correctly round-tripping, based on what you've already stated about
> the file: that it's a series of lines of text.

No, that can't work. If the last line of the input file lacks a line 
terminator, the writeln will add one. Let's make it simple: if your data 
file consists of only a single line, "spam", the first blob you receive 
will be "spam". If it consists of "spam\n" instead, the first blob you 
receive will also be "spam". Should you call write() or writeln()? 
Whichever you choose, you will get it wrong for some files.

> It might not be a
> byte-equivalent round-trip if you're changing newline style, any more
> than it already won't be for other reasons (file encoding, for
> instance). 

Ignore encodings and newline style. They are irrelevant. So long as the 
input and output writer use the same settings, the input will be copied 
unchanged.

> By reading the file as a series of Unicode lines, you're
> declaring that it contains lines of Unicode text, not arbitrary bytes,
> and so a valid representation of those lines of Unicode text is a
> faithful reproduction of the file. If you want a byte-for-byte identical
> file, open it in binary mode to do the copy; that's what we learn from
> FTPing files between Linux and Windows.

Both "spam" and "spam\n" are valid Unicode. By striping the newline, you 
make it impossible to distinguish them on the last line.

[1] For some definition of "nobody". Linguists consider that some words 
contain a space, e.g. "lawn tennis", "science fiction". This is called 
the open or spaced form of compound words. However, the trailing space at 
the end of the word is never considered part of the word.

[2] Except efficiency.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/