[Python-ideas] Iterating non-newline-separated files should be easier

Thu Jul 17 22:48:28 CEST 2014

I think it's fine to add something to stdlib that encapsulates your
example. (TBD: where?)

I don't think it is reasonable to add a new parameter to readline(),
because streams are widely implemented using duck typing -- every
implementation would have to be updated to support this.

On Thu, Jul 17, 2014 at 12:53 PM, Andrew Barnert <
abarnert at yahoo.com.dmarc.invalid> wrote:

> tl;dr: readline and friends should take an optional sep parameter (which
> also means adding an iterlines method).
>
> Recently, I was trying to add -0 support to a command-line tool, which
> means that it reads filenames out of stdin and/or a text file with \0
> separators instead of \n.
>
> This means that my code that looked like this:
>
>     with open(path, encoding=sys.getfilesystemencoding()) as f:
>         for filename in f:
>             do_stuff(filename)
>
> … turned into this (from memory, not the exact code):
>
>     def resplit(chunks, sep):
>         buf = b''
>         for chunk in chunks:
>             parts = (buf+chunk).split(sep)
>
>             yield from parts[:-1]
>             buf = parts[-1]
>         if buf:
>             yield buf
>
>     with open(path, 'rb') as f:
>         chunks = iter(lambda: f.read(4096), b'')
>         for line in resplit(chunks, b'\0'):
>             filename = line.decode(sys.getfilesystemencoding())
>             do_stuff(filename)
>
> Besides being a lot more code (and involving things that a novice might
> have problems reading like that two-argument iter), this also means that
> the file pointer is way ahead of the line that's just been iterated, I'm
> inefficiently buffering everything twice, etc.
>
> The problem is that readline is hardcoded to look for b'\n' for binary
> files, smart-universal-newline-thingy for text files, there's no way to
> reuse its machinery if you want to look for something different, and
> there's no way to access the internals that it uses if you want to
> reimplement it.
>
> While it might be possible to fix the latter problems in some generic and
> flexible way, that doesn't seem all that useful; really, other than
> changing the way readline splits, I don't think anyone wants to hook
> anything else about file objects. (On the other hand, people might want to
> hook it in more complex ways—e.g., pass a separator function instead of a
> separator string? I'm probably reaching there…)
>
> If I'm right, all that's needed is an extra sep=None keyword-only
> parameter to readline and friends (where None means the existing newline
> behavior), along with an iterlines method that's identical to __iter__
> except that it has room for that new parameter.
>
> One minor side problem: Sometimes you don't actually have a file, but some
> kind of file-like object. I realize that as 3.1 or so, this is supposed to
> mean it actually is an io.BufferedIOBase or etc., but there are still
> plenty of third-party modules that just demand and/or provide "something
> with read(size)" or the like. In fact, that's the case with the problem I
> ran into above; another feature uses a third-party module to provide
> file-like objects for members of all kinds of uncommon archive types, and
> unlike zipfile, that module wasn't changed to provide io subclasses when it
> was ported to 3.x. So, it might be worth having adapters that make it
> easier (or just possible…) to wrap such a thing in the actual io
> interfaces. (The existing wrappers aren't adapters—BufferedReader demands
> readinto(buf), not read(size); TextIOWrapper can only wrap a
> BufferedIOBase.) But that's really a separate issue (and the answer to that
> one may just be to hold firm
>  with the "file-like object means IOBase" and eventually every library you
> care about will work that way, even if you occasionally have to fix it
> yourself).
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140717/8a07586f/attachment.html>