Reading in cooked mode (was Re: Python MSI not installing, log file showing name of a Viatnemese communist revolutionary)

Sat Mar 22 21:37:43 EDT 2014

On Sun, Mar 23, 2014 at 12:07 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Sun, 23 Mar 2014 02:09:20 +1100, Chris Angelico wrote:
>
>> On Sun, Mar 23, 2014 at 1:50 AM, Steven D'Aprano
>> <steve+comp.lang.python at pearwood.info> wrote:
>>> Line endings are terminators: they end the line. Whether you consider
>>> the terminator part of the line or not is a matter of opinion (is the
>>> cover of a book part of the book?) but consider this:
>>>
>>>     If you say that the end of lines are *not* part of the line, then
>>>     that implies that some parts of the file are not inside any line at
>>>     all. And that would be just weird.
>>
>> Not so weird IMO. A file is not a concatenation of lines; it is a stream
>> of bytes.
>
> But a *text file* is a concatenation of lines. The "text file" model is
> important enough that nearly all programming languages offer a line-based
> interface to files, and some (Python at least, possibly others) make it
> the default interface so that iterating over the file gives you lines
> rather than bytes -- even in "binary" mode.

And lines are delimited entities. A text file is a sequence of lines,
separated by certain characters.

>> (Both interpretations make sense. I just wish the
>> most obvious form of iteration gave the cleaner/tidier version, or at
>> very least that there be some really obvious way to ask for
>> lines-without-endings.)
>
> There is: call strip('\n') on the line after reading it. Perl and Ruby
> spell it chomp(). Other languages may spell it differently. I don't know
> of any language that automatically strips newlines, probably because you
> can easily strip the newline from the line, but if the language did it
> for you, you cannot reliably reverse it.

That's not a tidy way to iterate, that's a way to iterate and then do
stuff. Compare:

for line in f:
    # process line with newline

for line in f:
    line = line.strip("\n")
    # process line without newline, as long as it doesn't have \r\n or something

for line in f:
    line = line.split("$")
    # process line as a series of dollar-delimited fields

The second one is more like the third than the first. Python does not
offer a tidy way to do the common thing, which is reading the content
of the line without its terminator.

>> Imagine the output of GNU find as a series of
>> records. You can ask for those to be separated by newlines (the default,
>> or -print), or by NULs (with the -print0 command). In either case, the
>> records do not *contain* that value, they're separated by it; the
>> records consist of file names.
>
> I have no problem with that: when interpreting text as a record with
> delimiters, e.g. from a CSV file, you normally exclude the delimiter.
> Sometimes the line terminator does double-duty as a record delimiter as
> well.

So why is the delimiter excluded when you treat the file as CSV, but
included when you treat the file as lines of text?

> Reading from a file is considered a low-level operation. Reading
> individual bytes in binary mode is the lowest level; reading lines in
> text mode is the next level, built on top of the lower binary mode. You
> build higher protocols on top of one or the other of that mode, e.g.
> "read a zip file" would be built on top of binary mode, "read a csv file"
> would be built on top of text mode.

I agree that reading a binary file is the lowest level. Reading a text
file is higher level, but to me "reading a text file" means "reading a
binary file and decoding it into Unicode text", and not "... and
dividing it into lines". Bear in mind that reading a CSV file can be
built on top of a Unicode decode, but not on a line-based iteration
(in case there are newlines inside quotes).

> As a low-level protocol, you ought to be able to copy a file without
> changing it by reading it in then writing it out:
>
> for blob in infile:
>     outfile.write(blob)
>
>
> ought to work whether you are in text mode or binary mode, so long as the
> infile and outfile are opened in the same mode. If Python were to strip
> newlines, that would no longer be the case.

All you need is a "writeln" method that re-adds the newline, and then
it's correctly round-tripping, based on what you've already stated
about the file: that it's a series of lines of text. It might not be a
byte-equivalent round-trip if you're changing newline style, any more
than it already won't be for other reasons (file encoding, for
instance). By reading the file as a series of Unicode lines, you're
declaring that it contains lines of Unicode text, not arbitrary bytes,
and so a valid representation of those lines of Unicode text is a
faithful reproduction of the file. If you want a byte-for-byte
identical file, open it in binary mode to do the copy; that's what we
learn from FTPing files between Linux and Windows.

> (Even high-level protocols should avoid unnecessary modifications to
> files. One of the more annoying, if not crippling, limitations to the
> configparser module is that reading an INI file in, then writing it out
> again destroys the high-level structure of the file: comments and blank
> lines are stripped, and records may be re-ordered.)

Precisely. If you read it as an INI file and then rewrite it as an INI
file, you risk damaging that sort of thing. If you parse a file as a
Python script, and then reconstitute it from the AST (with one of the
unparsers available), you have a guarantee that the result will
execute the exact same code. But it won't be the same file (although
Python's AST does guarantee order, unlike your INI file example).
Actually, this might be a useful transformation to do, sometimes -
part of a diff suite, maybe - if the old and new versions are
identical after an AST parse/unparse transformation, you don't need to
re-run tests, because there's no way a code bug can have been
introduced.

ChrisA