[Python-ideas] Iterating non-newline-separated files should be easier

Fri Jul 18 06:40:11 CEST 2014

On Jul 17, 2014, at 15:37, Guido van Rossum <guido at python.org> wrote:

> On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert <abarnert at yahoo.com.dmarc.invalid> wrote:
>> >I don't think it is reasonable to add a new parameter to readline()
>> 
>> The problem is that my code has significant problems for many use cases, and I don't think they can be solved.
>> 
>> Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc.
> 
> You can implement a subclass of io.BufferedIOBase that wraps an instance of io.RawIOBase (I think those are the right classes) where the wrapper adds a readuntil(separator) method. Whichever thing then wants to read the rest of the data should call read() on the wrapper object.
> 
> This still sounds a lot better to me than asking everyone to add a new parameter to their readline() (and the implementation).

[snip]

> You probably have to do a separate wrapper for text streams, the types and buffering implementation are just too different.

The problem isn't needing two separate wrappers, it's that the text wrapper if effectively impossible.

For binary files, MyBufferedReader.readuntil is a slightly modified version of _pyio.RawIOBase.readline, which only needs to access the public interface of io.BufferedReader (peek and read).

For text files, however, it needs to access private information from TextIOWrapper that isn't exposed from C to Python. And, unlike BufferedReader, TextIOWrapper has no way to peek ahead, or push data back onto the buffer, or anything else usable as a workaround, so even if you wanted to try to take care of the decoding state problems manually, you can't, except by reading one character at a time.

There are also some minor problems even for binary files (e.g., MyBufferedReader(f.raw) has a different file position from f, so if you switch between them you'll end up skipping part of the file), but these won't affect most use cases; the text file problem is the big one.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140717/18d1dabf/attachment-0001.html>