[Python-ideas] Iterating non-newline-separated files should be easier

Nick Coghlan ncoghlan at gmail.com
Sat Jul 19 11:27:49 CEST 2014


On 19 July 2014 05:01, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote:
> But in a case like that, the function is already buggy. I can see at
> least two problems with such an assumption:
>
> - what if universal newlines has been turned off and you're reading
>   a file created under (e.g.) classic Mac OS or RISC OS?

That's exactly the point though - people *do* assume "\n", and we've
gone to great lengths to make that assumption *more correct* (even
though it's still wrong sometimes).

We can't reverse course on that, and expect the outcome to make sense
to *people*. When making use of a configurable line endings feature
breaks (and it will), they're going to be confused, and the docs
likely aren't going to help much.

> - what if the file contains a single line which does not end with an
>   end of line character at all?
>
>    open('/tmp/junk', 'wb').write("hello world!")
>    next(open('/tmp/junk', 'r'))
>
> Have I missed something?
>
>
> Although I'm don't mind whether files grow a readrecords() method, or
> re-use the readlines() method, I'm not convinced that API decisions
> should be driven solely by the needs of programs which are already
> buggy.

It's not being driven by the needs of programs that are already buggy
- my preferences are driven by the fact that line endings and record
separators are *not the same thing*.  Thinking that they are is a
matter of confusing the conceptual data model with the implementation
of the framing at the serialisation layer. If we *do* try to treat
them as the same thing, then we have to go find *every single
reference* to line endings in the documentation and add a caveat about
it being configurable at file object creation time, so it might
actually be based on something completely arbitrary.

Line endings are *already* confusing enough that the "universal
newlines" mechanism was added to make it so that Python level code
could mostly ignore the whole "\n" vs "\r" vs "\r\n" distinction, and
just assume "\n" everywhere.

This is why I'm a fan of keeping things comparatively simple, and just
adding a new method (if we only add an iterator version) or two (if we
add a list version as well) specifically for this use case.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list