[Python-ideas] Iterating non-newline-separated files should be easier

Fri Jul 25 20:29:11 CEST 2014

On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i at gmail.com> wrote:

> > Andrew Barnert <abarnert at yahoo.com> writes:
> 
>>  On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i at gmail.com> wrote:
>>>  In order to newline="\0" case to work, it should behave 

>>> similar to
>>>  newline='' or newline='\n' case instead i.e., no 
>>> translation should take
>>>  place, to avoid corrupting embed "\n\r" characters.
>> 
>>  The draft PEP discusses this. I think it would be more consistent to
>>  translate for \0, just like \r and \r\n.
> 
> I read the [draft]. No translation is a better choice here. Otherwise
>> (at the very least) it breaks `find -print0` use case.

No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate.

As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that.

(It's less of an issue for binary files, because binary files can't take a newline parameter at all today, and because "no output translation" has been part of the definition of what "binary file" means all the way back to Python 1.x.)

> Backwards compatibility is preserved except that newline parameter
> accepts more values.

The same is true with the draft proposal. You've basically copied the exact same thing, except for what happens on output for newlines other than None, '', '\n', '\r', and '\r\n' in text files. Since that case cannot arise today, there are no backward compatibility issues. Your version is only a small change to the documentation and a small change to the code, but my version is an even smaller change to the documentation and no change to the code, so you can't argue this from a conservative point of view.

> 
>>  For the your script, there is no reason to pass newline=nl to the
>>  stdout replacement. The only effect that has on output is \n
>>  replacement, which you don't want. And if we removed that effect from
>>  the proposal, it would have no effect at all on output, so why pass
>>  it?
> 
> Keep in mind, I expect that newline='\0' does *not* translate 
> '\n' to
> '\0'. If you remove newline=nl then embed \n might be corrupted 

No, it's only corrupted if you _pass_ newline=nl. If you instead passed, e.g., newline='', nothing could possibly corrupted.

> i.e., it

> breaks `find -print0` use-case. Both newline=nl for stdout and end=nl
> are required here. Though (optionally) it would be nice to change
> `print()` so that it would use `end=file.newline or '\n'` by default
> instead.

That might be a nice change; I'll mention it in the next draft. But I think it's better to keep the changes as small and conservative as possible, so unless there's an upswell of support for it, I think anything that isn't actually necessary to solving the problem should be left out.

> There is also line_buffering parameter. From the docs:
> 
>   If line_buffering is True, flush() is implied when a call to write
>   contains a newline character.

The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated '\n'. So, it's doing the wrong thing with '\r' in most modes, and with '\n' in '' mode on non-Unix systems. So my thought was, just leave it broken.

But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three...

I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.

>>  Do you have a use case where you need to pass a non-standard newline
>>  to a text file/stream, but don't want newline replacement?
> 
> `find -print0` use case that my code implements above.
> 
>>  Or is it just a matter of avoiding confusion if people accidentally
>>  pass it for stdout when they didn't want it?
> 
> See the explanation above that starts with "Simple things should be 
> simple."

I still don't understand your point here, and just repeating it isn't helping. You're making simple things _less_ simple than they are in the draft, requiring slightly more change to the documentation and to the code and slightly more for people to understand just to allow them to pass an unnecessary parameter. That doesn't sound like an argument from simplicity to me.

But line_buffering definitely might be a good argument, in which case it doesn't matter how good this one is.