[Python-ideas] Iterating non-newline-separated files should be easier

Fri Jul 18 00:37:58 CEST 2014

On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert <
abarnert at yahoo.com.dmarc.invalid> wrote:

> On Thursday, July 17, 2014 1:48 PM, Guido van Rossum <guido at python.org>
> wrote:
>
>
> >I think it's fine to add something to stdlib that encapsulates your
> example. (TBD: where?)
>
> Good question about the where.
>
> The resplit function seems like it could be of more general use than just
> this case, but I'm not sure where it belongs. Maybe itertools?
>
> The iter(lambda: f.read(bufsize), b'') part seems too trivial to put
> anywhere, even just as an example in the docs—but given that it probably
> looks like a magic incantation to anyone who's a Python novice (even if
> they're a C or JS or whatever expert), maybe it is worth putting somewhere.
> Maybe io.iterchunks(f, 4096)?
>
> If so, the combination of the two into something like iterlines(f, b'\0')
> seems like it should go right alongside iterchunks.
>
>
> However…
>
>
> >I don't think it is reasonable to add a new parameter to readline()
>
> The problem is that my code has significant problems for many use cases,
> and I don't think they can be solved.
>
> Calling readline (or iterating the file) uses the underlying buffer (and
> stream decoder, for text files), keeps the file pointer in the same place,
> etc. My code doesn't, and no external code can. So, besides being less
> efficient, it leaves the file pointer in the wrong place (imagine using it
> to parse an RFC822 header then read() the body), doesn't properly decode
> files where the separator can be ambiguous with other bytes (try separating
> on '\0' in a UTF-16 file), etc.
>

You can implement a subclass of io.BufferedIOBase that wraps an instance of
io.RawIOBase (I think those are the right classes) where the wrapper adds a
readuntil(separator) method. Whichever thing then wants to read the rest of
the data should call read() on the wrapper object.

This still sounds a lot better to me than asking everyone to add a new
parameter to their readline() (and the implementation).

Maybe if we had more powerful adapters or wrappers so I could just say
> "here's a pre-existing buffer plus a text-file-like object, now wrap that
> up as a real TextIOBase for me" it would be possible to write something
> that worked from outside without these problems, but as things stand, I
> don't see an answer.
>

You probably have to do a separate wrapper for text streams, the types and
buffering implementation are just too different.

> Maybe put resplit in the stdlib, then just give iterlines as a 2-liner
> example (in the itertools recipes, or the file-I/O section of the
> tutorial?) where all these problems can be raised and not answered?
>

(Sorry, in a hurry / terribly distracted.)

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140717/1b0ed2be/attachment.html>