Reading in cooked mode (was Re: Python MSI not installing, log file showing name of a Viatnemese communist revolutionary)

Sun Mar 23 23:17:11 EDT 2014

On Mon, Mar 24, 2014 at 1:37 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Sun, 23 Mar 2014 12:37:43 +1100, Chris Angelico wrote:
>> And lines are delimited entities. A text file is a sequence of lines,
>> separated by certain characters.
>
> Are they really separated, or are they terminated?
>
>     a\nb\n
>
> Three lines or two? If you say three, then you consider \n to be a
> separator; if you say two, you consider it a terminator.
>
> The thing is, both points of view are valid. If \n is a terminator, then
> the above is valid text, but this may not be:
>
>     a\nb\nc
>
> since the last line is unterminated. (You might be generous and allow
> that every line must be terminated except possibly the last. Or you might
> be strict and consider the last line to be broken.)

It is a problem, and the correct usage depends on context.

I'd normally say that the first consists of two lines, the first being
"a" and the second being "b", and there is no third blank line. The
first line still doesn't consist of "a\n", though. It's more like how
environment variables are provided to a C program: separated by \0 and
the last one has to be terminated too.

In some situations, you would completely ignore the "c" in the last
example. When you're watching a growing log file, buffering might mean
that you see half of a line. When you're reading MUD text from a
socket, a partial line probably means it's broken across two packets,
and the rest of the line is coming. Either way, you don't process the
"c" in case it's the beginning of a line; you wait till you see the
"\n" separator that says that you now have a complete line. Got some
out-of-band indication that there won't be any more (like an EOF
signal)? Assume that "c" is the whole line, or assume the file is
damaged, and proceed accordingly.

> Given that the two points of view are legitimate and useful, how should a
> programming language treat lines? If the language treats the newline as
> separator, and strips it, then those who want to treat it as terminator
> are screwed -- you cannot tell if the last line is terminated or not.

That's my point, though. If you want to treat a file as lines, you
usually won't care whether the last one is terminated or not. You'll
have some means of defining lines, which might mean discarding the
last, or whatever it is, but the stream "a\nb\nc" will either become
["a", "b", "c"] or ["a", "b"] or ValueError or something, and that
list of lines is really all you care about. Universal newlines, as you
mention, means that "a\r\nb\r\n" will become the exact same thing as
"a\nb\n", and there's no way to recreate that difference - because it
*does not matter*.

> Here's another thought for you: words are separated by spaces. Nobody
> ever considers the space to be part of the word[1]. I think that nearly
> everyone agrees that both "spam eggs" and "spam      eggs" contain two
> words, "spam" and "eggs". I don't think anyone would say that the second
> example includes seven words, five of which are blank. Would we like to
> say that "spam\n\n\n\n\n\neggs" contains two lines rather than seven?

Ahh, that's a tricky one. For the simple concept of iterating over the
lines in a file, I would have to say that it's seven lines, five of
which are blank, same as "spam      eggs".split(" ") returns a
seven-element list. The tricky bit is that the term "word" means
"*non-empty* sequence of characters", which means that after splitting
on spaces, you discard all empty tokens in the list; but normally
"line" does NOT have that non-empty qualifier. However, a double
newline often means "paragraph break" as opposed to "line break", so
there's additional meaning applied there; that might be four
paragraphs, the last one unterminated (and a paragraph might well be
terminated by a single newline rather than two), and in some cases
might be squished to just two paragraphs because the paragraph itself
is required to be non-empty.

> With universal newline support, you can completely ignore the difference
> in platform-specific end-of-line markers. By default, Python will convert
> them to and from \n when you read or write a text file, and you'll never
> see any difference. Just program using \n in your source code, and let
> Python do the right thing. (If you need to handle end of line markers
> yourself, you can easily disable universal newline support.)

So why should we have to explicitly disable universal newlines to undo
the folding of \r\n and \n down to a single "end of line" indication,
but automatically get handling of \n or absence at the end of the
file? Surely that's parallel. In each case, you're taking the set of
lines as your important content, and folding together distinctions
that don't matter.

> I once had a Pascal compiler that would insert spaces, indentation, even
> change the case of words. Regardless of what you actually typed, it would
> pretty-print your code, then write the pretty-printed output when you
> saved. Likewise, if you read in a Pascal source file from an external
> editor, then saved it, it would overwrite the original with it's pretty-
> printed version. That sort of thing may or may not be appropriate for a
> high-level tool which is allowed to impose whatever structure it likes on
> its data files, but it would be completely inappropriate for a low-level
> almost-raw process (more like lightly blanched than cooked) like reading
> from a text file in Python.

GW-BASIC used to do something similar, always upper-casing keywords
like "print" and "goto", and putting exactly one space between the
line number and the code; in the file that it stored on the disk, and
probably what it stored in memory, those were stored as single tokens.
Obviously the process of turning "print" into a one-byte marker and
then back into a word is lossy, so the result comes out as "PRINT"
regardless of how you typed it. Not quite the same, but it does give a
justification for the conversion (hey, it was designed so you could
work off floppy disks, so space was important), and of course the
program would run just the same.

>> I agree that reading a binary file is the lowest level. Reading a text
>> file is higher level, but to me "reading a text file" means "reading a
>> binary file and decoding it into Unicode text", and not "... and
>> dividing it into lines". Bear in mind that reading a CSV file can be
>> built on top of a Unicode decode, but not on a line-based iteration (in
>> case there are newlines inside quotes).
>
> Of course you can build a CSV reader on top of line-based iteration. You
> just need an accumulator inside your parser: if, at the end of the line,
> you are still inside a quoted field, keep processing over the next line.

Sure, but that's reaching past the line-based iteration. You can't
give it a single line and get back the split version; it has to be a
stateful parser that comprehends the whole file. But you can give it
Unicode data and have it completely ignore the byte stream that
produced it - which you can't do with, say, a zip reader.

>> All you need is a "writeln" method that re-adds the newline, and then
>> it's correctly round-tripping, based on what you've already stated about
>> the file: that it's a series of lines of text.
>
> No, that can't work. If the last line of the input file lacks a line
> terminator, the writeln will add one. Let's make it simple: if your data
> file consists of only a single line, "spam", the first blob you receive
> will be "spam". If it consists of "spam\n" instead, the first blob you
> receive will also be "spam". Should you call write() or writeln()?
> Whichever you choose, you will get it wrong for some files.

But you'll produce a file full of lines. You might not have something
perfectly identical, byte for byte, but it will have the same lines,
and the process will be idempotent.

>> It might not be a
>> byte-equivalent round-trip if you're changing newline style, any more
>> than it already won't be for other reasons (file encoding, for
>> instance).
>
> Ignore encodings and newline style. They are irrelevant. So long as the
> input and output writer use the same settings, the input will be copied
> unchanged.

Newline style IS relevant. You're saying that this will copy a file perfectly:

out = open("out", "w")
for line in open("in"):
    out.write(line)

but it wouldn't if the iteration and write stripped and recreated
newlines? Incorrect, because this version will collapse \r\n into \n.
It's still a *text file copy*. (And yes, I know about 'with'. Shut
up.) It's idempotent, not byte-for-byte perfect.

ChrisA