[Python-3000] Draft PEP for New IO system

Wed Feb 28 09:20:01 CET 2007

[reposting since the first time it didn't get through...]

On 26/02/2007 22.35, Mike Verdone wrote:

 > Daniel Stutzbach and I have prepared a draft PEP for the new IO system
 > for Python 3000. This document is, hopefully, true to the info that
 > Guido wrote on the whiteboards here at PyCon. This is still a draft
 > and there's quite a few decisions that need to be made. Feedback is
 > welcomed.

Thanks for this!

 > Raw I/O
 > The abstract base class for raw I/O is RawIOBase.  It has several
 > methods which are wrappers around the appropriate operating system
 > call.  If one of these functions would not make sense on the object,
 > the implementation must raise an IOError exception.  For example, if a
 > file is opened read-only, the .write() method will raise an IOError.
 > As another example, if the object represents a socket, then .seek(),
 > .tell(), and .truncate() will raise an IOError.
 >
 >    .read(n: int) -> bytes
 >    .readinto(b: bytes) -> int
 >    .write(b: bytes) -> int

What are the requirements here?

- Can read()/readinto() return *less* bytes than specified?
- Can read() return a 0-sized byte object (=no data available)?
- Can read() return *more* bytes than specified (think of a datagram socket or 
a decompressing stream)?
- Can readinto() read *less* bytes than specified?
- Can readinto() read zero bytes?
- Should read()/readinto() raise EOFError?
- Can write() write less bytes than specified?
- Can write() write zero bytes?

Please, see also the examples at the end of the mail before providing an answer :)

 >    .seek(pos: int, whence: int = 0) -> None
 >    .tell() -> int
 >    .truncate(n: int = None) -> None
 >    .close() -> None

Why should this very low-level basic type define *two* read methods? Assuming 
that readinto() is the most primitive, can we have the ABC RawIOBase provide a 
default read() method that calls readinto?

Consider providing more ABC/mixins to help implementations. 
ReadIOBase/WriteIOBase are pretty obvious:

class RawIOBase:
     def readable(self): return False
     def writeable(self): return False
     def seekable(self): return False

     def read(self,n): raise IOError
     def readinto(self,b): raise IOError
     def write(self,b): raise IOError
     def seek(self,pos,wh): raise IOError
     def tell(self): raise IOError
     def truncate(self,n=None): raise IOError

class ReadIOBase(RawIOBase):
     def readable(self): return True
     def read(self, n):
         b = bytes(n)  #whatever
         self.readinto(b)
         return b

class MySpecialReader(ReadIOBase):
     def readinto(self, b):
         # ....
         # must implement only this and nothing else

class MySpecialReaderWriter(ReadIOBase, WriteIOBase):
     def readinto(self, b):
         # ....
     def write(self, b):
         # ....

 >     (should these "is_" functions be attributes instead?
 > "file.readable == True")

Yes, I think readable/writeable/seekable/fileno *perfectly* match the good 
usage of attributes/properties. They all provide a value without any 
side-effect and that can be computed without doing O(n)-style computations.

 > Buffered I/O
 > The next layer is the Buffer I/O layer which provides more efficient
 > access to file-like objects. The abstract base class for all Buffered

I think you probably want the buffer size to be optionally specified by the 
user, for the standard 4 implementations.

 > Q: Do we want to mandate in the specification that switching between
 > reading to writing on a read-write object implies a .flush()?  Or is
 > that an implementation convenience that users should not rely on?

I'd be glad if using flush() wasn't a requirement for users of the class. It 
always strikes me as abstraction leak to me.

 > TextIOBase class implementations additionally provide the following methods:
 >
 >     .readline(self)
 >
 >        Read until newline or EOF and return the line.
 >
 >     .readlinesiter()
 >
 >        Returns an iterator that returns lines from the file (which
 > happens to be 'self').
 >
 >     .next()
 >
 >        Same as readline()
 >
 >     .__iter__()
 >
 >        Same as readlinesiter()

Note sure why you need "readlinesiter()" at all. I thought Py3k was disposing 
most of the "fooiter()" functions (thinking of dicts...).

 > Another way to do it is as follows (we should pick one or the other):
 >
 >     .__init__(self, buffer, encoding=None, newline=None)

I think this is clearer. I can't find a good real-world usecase for requiring 
the two parameters version.

==========================================================================

Now for some real example. Let's say I'm given a readable RawIOBase object. 
I'm told that it's a foobar-compressed utf-8 text-file. I have this API available:

     class Foobar:
        # initialize decompressor
        __init__()

        # feed compressed bytes and get uncompressed bytes.
        # The uncompressed data can be smaller, equal or larger
        # than the compressed data
        decompress(bytes) -> bytes

        # finish decompression and get tail
        flush() -> bytes

This is basically similar to the way zlib.decompress/flush works. I would like 
to wrap the readable RawIOBase object in a way that I obtain a textual 
file-like with readline() etc.

This is pretty hard to do with the current I/O library (you need to write a 
lot of code). It'd be good if the new I/O library makes it easier to achieve.

Let's see. I start with a raw I/O reader:

class FoobarRaw(RawIOBase):
     def __init__(self, raw):
         self.raw = raw
         self._d = Foobar()
         self._buf = bytes()

     def readable(self):
         return True

     # I assume RawIOBase.read() must return the
     #   exact number of bytes (unless at the end).
     # I assume RawIOBase.read() raises EOFError when done
     # I assume readinto() does not exist...
     def read(self, n):
         try:
             while len(self._buf) < n:
                 b = self.raw.read(n)
                 self._buf += self._d.decompress(b)
         except EOFError:
             self._buf += self._d.flush()

         d = self._buf[:n]
         del self._buf[:n]
         if not d:
             raise EOFError
         return d

and complete the job:

def foobar_open(raw):
     return TextIOWrapper(BufferedReader(FoobarRaw(raw)), encoding="utf-8")

for L in foobar_open(sock):
     print(L)

Uhm, looks great!

==========================================================================

Now, it might be interesting playing with the different semantic of 
RawIOBase.read(), which I proposed above, and see how the implementation of 
FoobarRaw.read() changes.

For instance (now being radical): why don't we drop the "n" argument 
altogether? We could just define it like this:

     # Returns a block of data, whose size is implementation-defined
     # and may vary between calls. It never returns a zero-sized block.
     # Raises EOFError when done.
     read() -> bytes

After all, there's a BufferedIO layer to handle buffering and exact-size 
reads/writes. If we go this way, the above example is even easier:

     def read(self):
         try:
            b = self.raw.read() # any size!
            return self._d.decompress(b)
         except EOFError:
            b = self._d.flush()
            if not b:
               raise EOFError
            return b

It would also work well for sockets, since they would return exactly the 
buffer of data arrived from the network, and simply block once if there's not 
data available.
-- 
Giovanni Bajo