[Python-ideas] Iterating non-newline-separated files should be easier

Sun Jul 20 01:28:55 CEST 2014

(replies to multiple messages here)

On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

>On 19 July 2014 03:32, Chris Angelico <rosuav at gmail.com> wrote:
>> On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>> I still favour my proposal there to add a separate "readrecords()"
>>> method, rather than reusing the line based iteration methods - lines
>>> and arbitrary records *aren't* the same thing
>>
>> But they might well be the same thing. Look at all the Unix commands
>> that usually separate output with \n, but can be told to separate with
>> \0 instead. If you're reading from something like that, it should be
>> just as easy to split on \n as on \0.
>
>Python isn't Unix, and Python has never supported \0 as a "line
>ending".

Well, yeah, but Python is used on Unix, and it's used to write scripts that interoperate with other Unix command-line tools.

For the record, the reason this came up is that someone was trying to use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script. 

In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.

> Changing the meaning of existing constructs is fraught with
>complexity, and should only be done when there is absolutely no
>alternative. In this case, there's an alternative: a new method,
>specifically for reading arbitrary records.

This was basically my original suggestion, so obviously I don't think it's a terrible idea. But I don't think it's as good.

First, which of these is more readable, easier for novices to figure out how to write, etc.:

    with open(path, newline='\0') as f:
        for line in f:
            handle(line.rstrip('\0'))

    with open(path) as f:
        for line in iter(lambda: f.readrecord('\0'), ''):
            handle(line.rstrip('\0'))

Second, as Guido mentioned at the start of this thread, existing file-like object types (whether they implement BufferedIOBase or TextIOBase, or just duck-type the interfaces) are not going to have the new functionality. Construction has never been part of the interface of the file-like object API; opening a real file has always looked different from opening a member file in a zip archive or making a file-like wrapper around a socket transport or whatever. But using the resulting object has always been the same. Adding a readrecord method or changing the interface readline means that's no longer true.

There might be a good argument for making the change more visible—that is, using a different parameter on the open call instead of reusing the existing newline. (And that's what Alexander originally suggested as an alternative to my readrecord idea.) That way, it's much more obvious that spam.open or eggs.makefile or whatever doesn't support alternate line endings, without having to read its documentation on what newline means. But either way, I think it should go in the open function, not the file-object API.

On Saturday, July 19, 2014 2:28 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> - my preferences are driven by the fact that line endings and record
> separators are *not the same thing*.  Thinking that they are is a
> matter of confusing the conceptual data model with the implementation
> of the framing at the serialisation layer. 

Yes, using lines implicitly as records can lead to confusion—but people actually do that all the time; this isn't a new problem, and it's exactly the same problem with \r\n, or even \n, as with \0. When you open up TextEdit and write a grocery list with one item on each line, those newlines are not part of the items. When you pipe the output of find to a script, the newlines are not part of the filenames. When you pipe the output of find -0 to a script, the \0 terminators are not part of the filenames.

> Line endings are *already* confusing enough that the "universal
> newlines" mechanism was added to make it so that Python level code
> could mostly ignore the whole "\n" vs "\r" vs 
> "\r\n" distinction, and
> just assume "\n" everywhere.

I understand the point here. There are cases where universal newlines let you successfully ignore the confusion rather than dealing with it, and newline='\0' will not be useful in those cases.

But then newline='\r' is also never useful in those cases. The new behavior will be useful in exactly the cases where '\r' already is—no more, but no less.

> This is why I'm a fan of keeping things comparatively simple, and just
> adding a new method (if we only add an iterator version) or two (if we
> add a list version as well) specifically for this use case.

Actually, the obvious new method is neither the iterator version nor the list version, but a single-record version, readrecord. Sometimes you need readline/readrecord, and it's conceptually simpler for the user. And of course the implementation is a lot simpler; you don't need to build a new iterator object that references the file for readrecord the way you do for iterrecords. And finally, if you only have one of the two, as bad as iter(lambda: f.readrecord('\0'), '') may look to novices, next(f.iterrecords('\0')) would probably be even more confusing.

But we could also add an iterrecords, for two methods.

And as for the list-based version… well, I don't even understand why readlines still exists in 3.x (much less why the tutorial suggests it), so I'd be fine not having a readrecords, but I don't have any real objection.

On Saturday, July 19, 2014 1:06 PM, Guido van Rossum <guido at python.org> wrote:

>I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).

I get the feeling either there's a much simpler way to wrap a file object that I'm missing, or that you think there is.

In order to do the equivalent of readrecord, you have to do one of three things:

1. Read character by character, which can be incredibly slow.

2. Peek or push back on the buffer, as the io classes' readline methods do.

3. Put another buffer in front of the file, which means you have two objects both sharing the same file but with effective file pointers out of sync. And you have to reproduce all of the file-like-object API methods for your new buffered object (a lot more work, and a lot more to get wrong—effectively, it means you have to write all of BufferedReader or TextIOWrapper, but modified to wrap another buffered file instead of wrapping the lower-level thing). And no matter how you do it, it's obviously going to be less efficient.

If there's a lighter version of #3 that makes sense, I'm not seeing it. Which is probably a problem with my lack of insight, but I'd appreciate a pointer in the right direction.

>I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).

Maybe using a different argument is a better answer. (That's what Alexander suggested originally.)

The reason both I and people on the bug thread suggested using newline instead is because the behavior you want from sep='\0' happens to be identical to the behavior you get from newline='\r', except with '\0' instead of '\r'.

And that's the best argument I have for reusing newline: someone has already worked out and documented all the implications of newline, and people have already learned them, so if we really want the same functionality, it makes sense to reuse it. 

But I realize that argument only goes so far. It wasn't obvious, until I looked into it, that I wanted the exact same functionality.

>I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-).

Sure, it would have been a lot better for find and friends to grow a --escape parameter instead of -0, but I think that ship has sailed.

>I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character.
>
>I value the equivalence of __next__() and readline().
>
>I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data).

Again, I don't see any way to do this sensibly that wouldn't be a whole lot more work than just forking the io package.

But maybe that's the answer: I can write _io2 as a fork of _io with my changes, the same for _pyio2 (for PyPy), and then the only thing left to write is a __main__ for the package that wraps up _io2/_pyio2 in the io ABCs (and re-exports those ABCs).