[Python-ideas] Iterating non-newline-separated files should be easier

Andrew Barnert abarnert at yahoo.com
Fri Jul 18 08:26:28 CEST 2014


On Jul 17, 2014, at 21:47, Guido van Rossum <guido at python.org> wrote:

> Well, I had to look up the newline option for open(), even though I probably invented it. :-)

While we're at it, I think most places in the documentation and docstrings that refer to the parameter, except open itself, call it newlines (e.g., io.IOBase.readline), and as far as I can tell it's been like that from day one, which shows just how much people pay attention to the current feature. :)

> Would it still apply only to text files?

I think it makes sense to apply to binary files as well. Splitting binary files on \0 (or, for that matter, \r\n...) is probably at least as common a use case as text files.

Obviously the special treatment for "" (as a universal-newline-behavior flag) wouldn't carry over to b"" (which might as well just be an error, although I suppose it could also mean to split on every byte, as with bytes.split?). Also, I'm not sure if the write behavior (replace terminal "\n" with newline) should carry over from text to binary, or just ignore newline on write.

Binary files don't need the special-casing for b"" (with text files, that's more a universal-newlines flag than a newline value), and I'm not sure if they need the write behavior or only the read behavior.

> On Thursday, July 17, 2014, Andrew Barnert <abarnert at yahoo.com.dmarc.invalid> wrote:
>> On Jul 17, 2014, at 20:36, Chris Angelico <rosuav at gmail.com> wrote:
>> 
>> > On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> >> You seem to be talking about the implementation of the change, but what
>> >> is the interface? Having made all these changes, how does it effect
>> >> Python code? You have a use-case of splitting on something other than
>> >> the standard newlines, so how does one do that? E.g. suppose I have a
>> >> file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line
>> >> character. How would I iterate over lines in this file?
>> >
>> > The way I understand it is this:
>> >
>> > for line in open("spam.txt", newline="\u0085"):
>> >    process(line)
>> >
>> > If that's the case, I would be strongly in favour of this. Nice and
>> > clean, and should break nothing; there'll be special cases for
>> > newline=None and newline='', and the only change is that, instead of a
>> > small number of permitted values ('\n', '\r', '\r\n'), any string (or
>> > maybe any one-character string plus '\r\n'?) would be permitted.
>> >
>> > Effectively, it's not "iterate over this file, divided by \0 instead
>> > of newlines", but it's "this file uses the unusual encoding of
>> > newline=\0, now iterate over lines in the file". Seems a smart way to
>> > do it IMO.
>> 
>> Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea.
>> 
>> (Apologies for overestimating the obviousness of that.)
>> 
>> 
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
> 
> 
> -- 
> --Guido van Rossum (on iPad)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140717/7b76fe06/attachment.html>


More information about the Python-ideas mailing list