[Python-3000] Draft PEP for New IO system

Thu Mar 1 00:12:07 CET 2007

I just uploaded Patch 1671314 to SourceForge with a C implementation
of a Raw File I/O type, along with unit tests.  It still needs work
(especially for supporting very large files and non-unixy systems),
but should serve as a very good starting point.

On 2/26/07, Mike Verdone <mike.verdone at gmail.com> wrote:
> Hi all,
>
> Daniel Stutzbach and I have prepared a draft PEP for the new IO system
> for Python 3000. This document is, hopefully, true to the info that
> Guido wrote on the whiteboards here at PyCon. This is still a draft
> and there's quite a few decisions that need to be made. Feedback is
> welcomed.
>
> We've published it on Google Docs here:
> http://docs.google.com/Doc?id=dfksfvqd_1cn5g5m
>
> What follows is a plaintext version.
>
> Thanks,
>
> Mike.
>
>
> PEP: XXX
> Title: New IO
> Version:
> Last-Modified:
> Authors: Daniel Stutzbach, Mike Verdone
> Status: Draft
> Type:
> Created: 26-Feb-2007
>
> Rationale and Goals
> Python allows for a variety of file-like objects that can be worked
> with via bare read() and write() calls using duck typing. Anything
> that provides read() and write() is stream-like. However, more exotic
> and extremely useful functions like readline() or seek() may or may
> not be available on a file-like object. Python needs a specification
> for basic byte-based IO streams to which we can add buffering and
> text-handling features.
>
> Once we have a defined raw byte-based IO interface, we can add
> buffering and text-handling layers on top of any byte-based IO class.
> The same buffering and text handling logic can be used for files,
> sockets, byte arrays, or custom IO classes developed by Python
> programmers. Developing a standard definition of a stream lets us
> separate stream-based operations like read() and write() from
> implementation specific operations like fileno() and isatty(). It
> encourages programmers to write code that uses streams as streams and
> not require that all streams support file-specific or socket-specific
> operations.
>
> The new IO spec is intended to be similar to the Java IO libraries,
> but generally less confusing. Programmers who don't want to muck about
> in the new IO world can expect that the open() factory method will
> produce an object backwards-compatible with old-style file objects.
> Specification
> The Python I/O Library will consist of three layers: a raw I/O layer,
> a buffer I/O layer, and a text I/O layer.  Each layer is defined by an
> abstract base class, which may have multiple implementations.  The raw
> I/O and buffer I/O layers deal with units of bytes, while the text I/O
> layer deals with units of characters.
> Raw I/O
> The abstract base class for raw I/O is RawIOBase.  It has several
> methods which are wrappers around the appropriate operating system
> call.  If one of these functions would not make sense on the object,
> the implementation must raise an IOError exception.  For example, if a
> file is opened read-only, the .write() method will raise an IOError.
> As another example, if the object represents a socket, then .seek(),
> .tell(), and .truncate() will raise an IOError.
>
>     .read()
>     .write()
>     .seek()
>     .tell()
>     .truncate()
>     .close()
>
> Additionally, it defines a few other methods:
>
>     (should these "is_" functions be attributes instead?
> "file.readable == True")
>
>     .is_readable()
>
>        Returns True if the object was opened for reading, False
> otherwise.  If False, .read() will raise an IOError if called.
>
>     .is_writable()
>
>        Returns True if the object was opened write writing, False
> otherwise.  If False, .write() and .truncate() will raise an IOError
> if called.
>
>     .is_seekable()  (Should this be called .is_random()?  or
> .is_sequential() with opposite return values?)
>
>        Returns True if the object supports random-access (such as disk
> files), or False if the object only supports sequential access (such
> as sockets, pipes, and ttys).  If False, .seek(), .tell(), and
> .truncate() will raise an IOError if called.
>
> Iff a RawIOBase implementation operates on an underlying file
> descriptor, it must additionally provide a .fileno() member function.
> This could be defined specifically by the implementation, or a mix-in
> class could be used (Need to decide about this).
>
>     .fileno()
>
>        Returns the underlying file descriptor (an integer)
>
> Initially, three implementations will be provided that implement the
> RawIOBase interface: FileIO, SocketIO, and ByteIO (also MMapIO?).
> Each implementation must determine whether the object supports random
> access as the information provided by the user may not be sufficient
> (consider open("/dev/tty", "rw") or open("/tmp/named-pipe", "rw").  As
> an example, FileIO can determine this by calling the seek() system
> call; if it returns an error, the object does not support random
> access.  Each implementation may provided additional methods
> appropriate to its type.  The ByteIO object is analogous to Python 2's
> cStringIO library, but operating on the new bytes type instead of
> strings.
> Buffered I/O
> The next layer is the Buffer I/O layer which provides more efficient
> access to file-like objects.  The abstract base class for all Buffered
> I/O implementations is BufferedIOBase, which provides similar methods
> to RawIOBase:
>
>     .read()
>     .write()
>     .seek()
>     .tell()
>     .truncate()
>     .close()
>     .is_readable()
>     .is_writable()
>     .is_seekable()
>
> Additionally, the abstract base class provides one member variable:
>
>     .raw
>
>        Provides a reference to the underling RawIOBase object.
>
> The BufferIOBase methods' syntax is identical to that of RawIOBase,
> but may have different semantics.  In particular, BufferIOBase
> implementations may read more data than requested or delay writing
> data using buffers.  For the most part, this will be transparent to
> the user (unless, for example, they open the same file through a
> different descriptor).
>
> There are four implementations of the BufferIOBase abstract base
> class, described below.
> BufferedReader
> The BufferedReader implementation is for sequential-access read-only
> objects.  It does not provide a .flush() method, since there is no
> sensible circumstance where the user would want to discard the read
> buffer.
> BufferedWriter
> The BufferedWriter implementation is for sequential-access write-only
> objects.  It provides a .flush() method, which forces all cached data
> to be written to the underlying RawIOBase object.
> BufferedRWPair
> The BufferRWPair implementation is for sequential-access read-write
> objects such as sockets and ttys.  As the read and write streams of
> these objects are completely independent, it could be implemented by
> simply incorporating a BufferedReader and BufferedWriter instance.  It
> provides a .flush() method that has the same semantics as a
> BufferWriter's .flush() method.
> BufferedRandom
> The BufferRandom implementation is for all random-access objects,
> whether they are read-only, write-only, or read-write.  Compared to
> the previous classes that operate on sequential-access objects, the
> BufferedRandom class must contend with the user calling .seek() to
> reposition the stream.  Therefore, an instance of BufferRandom must
> keep track of both the logical and true position within the object.
> It provides a .flush() method that forces all cached write data to be
> written to the underlying RawIOBase object and all cached read data to
> be forgotten (so that future reads are forced to go back to the disk).
>
> Q: Do we want to mandate in the specification that switching between
> reading to writing on a read-write object implies a .flush()?  Or is
> that an implementation convenience that users should not rely on?
>
> For a read-only BufferRandom object, .is_writable() returns False and
> the .write() and .truncate() methods throw IOError.
>
> For a write-only BufferRandom object, .is_readable() returns False and
> the .read() method throws IOError.
> Text I/O
> The text I/O layer provides functions to read and write strings from
> streams. Some new features include universal newlines and character
> set encoding and decoding.  The Text I/O layer is defined by a
> TextIOBase abstract base class.  It provides several methods that are
> similar to the BufferIOBase methods, but operate on a per-character
> basis instead of a per-byte basis.  These methods are:
>
>     .read()
>     .write()
>     .seek()
>     .tell()
>     .truncate()
>
> TextIOBase implementations also provide several methods that are
> pass-throughs to the underlaying BufferIOBase objects:
>
>     .close()
>     .is_readable()
>     .is_writable()
>     .is_seekable()
>
> TextIOBase class implementations additionally provide the following methods:
>
>     .readline(self)
>
>        Read until newline or EOF and return the line.
>
>     .readlinesiter()
>
>        Returns an iterator that returns lines from the file (which
> happens to be 'self').
>
>     .next()
>
>        Same as readline()
>
>     .__iter__()
>
>        Same as readlinesiter()
>
>     .__enter__()
>
>        Context management protocol. Returns self.
>
>     .__exit__()
>
>        Context management protocol. No-op.
>
> Two implementations will be provided by the Python library.  The
> primary implementation, TextIOWrapper, wraps a Buffered I/O object.
> Each TextIOWrapper object has a property name ".buffer" that provides
> a reference to the underlying BufferIOBase object.  It's initializer
> has the following signature:
>
>     .__init__(self, buffer, encoding=None, universal_newlines=True, crlf=None)
>
>        Buffer is a reference to the BufferIOBase object to be wrapped
> with the TextIOWrapper.  "Encoding" refers to an encoding to be used
> for translating between the byte-representation and
> character-representation.  If "None", then the system's locale setting
> will be used as the default.  If "universal_newlines" is true, then
> the TextIOWrapper will automatically translate the bytes "\r\n" into a
> single newline character during reads.  If "crlf" is False, then a
> newline will be written as "\r\n".  If "crlf" is True, then a newline
> will be written as "\n".  If "crlf" is None, then a system-specific
> default will be used.
>
> Another way to do it is as follows (we should pick one or the other):
>
>     .__init__(self, buffer, encoding=None, newline=None)
>
>        Same as above but if newline is not None use that as the
> newline pattern (for reading and writing), and if newline is not set
> attempt to find the newline pattern from the file and if we can't for
> some reason use the system default newline pattern.
>
> Another implementation, StringIO, creates a file-like TextIO
> implementation without an underlying Buffer I/O object.  While similar
> functionality could be provided by wrapping a BytesIO object in a
> Buffered I/O object in a TextIOWrapper, the String I/O object allows
> for much greater efficiency as it does not need to actually performing
> encoding and decoding.  A String I/O object can just store the encoded
> string as-is.  The String I/O object's __init__ signature is similar
> to the TextIOWrapper, but without the "buffer" parameter.
>
> END OF PEP
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/daniel%40stutzbachenterprises.com
>

-- 
Daniel Stutzbach, Ph.D.             President, Stutzbach Enterprises LLC