[Python-3000] Google Sprint Ideas

Mon Aug 21 06:32:18 CEST 2006

On 8/20/06, Talin <talin at acm.org> wrote:
> Guido van Rossum wrote:
> > On 8/20/06, Talin <talin at acm.org> wrote:
> >> Guido van Rossum wrote:
> >> > On 8/20/06, Paul Moore <p.f.moore at gmail.com> wrote:
> >>
> >> > Without endorsing every detail of his design, tomer filiba has written
> >> > several blog (?) entries about this, the latest being
> >> > http://sebulba.wikispaces.com/project+iostack+v2 . You can also look
> >> > at sandbox/sio/sio.py in svn.
> >>
> >> One comment after reading this: If we're going to re-invent the Java/C#
> >> i/o library, could we at least use the same terminology? In particular,
> >> the term "Layer" has connotations which may be confusing in this context
> >> - I would prefer something like "Adapter" or "Filter".
> >
> > That's an example of what I meant when I said "without endorsing every
> > detail".
> >
> > I don't know which terminology C++ uses beyond streams. I think Java
> > uses Streams for the lower-level stuff and Reader/Writer for the
> > higher-level stuff -- or is it the other way around?
>
> Well, the situation with Java is kind of complex. There are two sets of
> stream classes, but rather than classifying them as "low-level" and
> "high-level", a better classification is "old" and "new". The old
> classes (InputStream/OutputStream) are byte-oriented, whereas the newer
> ones (Reader/Writer) are character-oriented. It it not the case,
> however, that the character-oriented interface sits on top of the
> byte-oriented interface - rather, both interfaces are implemented by a
> number of different back ends.

How sure are you of all that? I always thought that these have about
the same age, and that the main distinction is byte vs. char
orientation. Also, the InputStreamReader class clearly sits on top of
the InputStream class (but surprisingly recommends that for efficiency
you do buffering on the reader side instead of on the stream side --
should we consider this for Python too?). And FileReader is a subclass
of InputStreamReader. (OK, further investigation does show that
FileInputStream exists since JDK 1.0 while InputStreamReader exists
since JDK 1.1. But there's much newer Java I/O in the "nio" package,
and there's work going on for "nio2", JSR 203.)

> For purposes of Python, it probably makes more sense to look at the .Net
> System.IO.Stream. (As a general rule, the .Net classes are refactored
> versions of the Java classes, which is both good and bad. It's best to
> study both if one is looking for inspiration.)

Perhaps you can tell us more about that? I've used the Java I/O system
sufficiently to have a feel for how it is actually used, which helps
me find my way in the docs; but for .NET I fear that I would have to
go on a sabbattical to make sense of it. And I don't have time for
that.

> Hmmm, apparently the .Net documentation *does* use the term 'layer' to
> describe one stream wrapping another - which I still find strange. To my
> mind, the term 'layer' can either describe a particular design stratum
> within an architecture - such as the 'device layer' of an operating
> system - or it can describe a portion of a document, such as a drawing
> layer in a CAD program.

It's used whenever you could draw a diagram of several layers of
software sitting on top of each other. Perhaps usually layers are
bigger (like device layers) but I see nothing wrong with declaring
that Python I/O consists of three layers.

> I don't normally think of a single instance of a
> class wrapping another instance as constituting a "layer" - I usually
> use the term "adapter" or "proxy" to describe that case.
>
> (OK, so I'm pedantic about naming. Now you know why one of my side
> projects is writing an online programmer's thesaurus -- using
> Python/TurboGears of course!)

Wouldn't it make more sense to contribute to wikipedia at this point?

> >> Also, I notice that this proposal removes what I consider to be a nice
> >> feature of Python, which is that you can take a plain file object and
> >> iterate over the lines of the file -- it would require a separate line
> >> buffering adapter to be created. I think I understand the reasoning
> >> behind this - in a world with multiple text encodings, the definition of
> >> "line" may not be so simple. However, I would assume that the "built-in"
> >> streams would support the most basic, least-common-denominator encodings
> >> for convenience.
> >
> > First time I noticed that. But perhaps it's the concept of "plain file
> > object" that changed? My own hierarchy (which I arrived at without
> > reading tomer's proposal) is something like this:
> >
> > (1) Basic level (implemented in C) -- open, close, read, write, seek,
> > tell. Completely unbuffered, maps directly to system calls. Does
> > binary I/O only.
> >
> > (2) Buffering. Implements the same API as (1) but adds buffering. This
> > is what one normally uses for binary file I/O. It builds on (1), but
> > can also be built on raw sockets instead. It adds an API to inquire
> > about the amount of buffered data, a flush() method, and ways to
> > change the buffer size.
> >
> > (3) Encoding and line endings. Implements a somewhat different API,
> > for reading/writing text files; the API resembles Python 2's I/O
> > library more. This is where readline() and next() giving the next line
> > are implemented. It also does newline translation to/from the
> > platform's native convention (CRLF or LF, or perhaps CR if anyone
> > still cares about Mac OS <= 9) and Python's convention (always \n). I
> > think I want to put these two features (encoding and line endings) in
> > the same layer because they are both text related. Of course you can
> > specify ASCII or Latin-1 to effectively disable the encoding part.
> >
> > Does this make more sense?
>
> I understood that much -- this is pretty much the way everyone does
> things these days (our own custom stream library at work looks pretty
> much like this too.)

So you have the buffering between the binary I/O and the text I/O too?

> The question I was wondering is, will the built-in 'file' function
> return an object of level 3?

I am hoping to get rid of 'file' altogether. Instead, I want to go
back to 'open'. Calling open() with a binary mode argument would
return a layer 2 or layer 1 (if unbuffered) object; calling it with a
text mode would return a layer 3 object. open() would grow additional
keyword parameters to specify the encoding, the desired newline
translation, and perhaps other aspects of the layering that might need
control.

BTW in response to Alexander Belopolsky: yes, I would like to continue
support for something like readinto() by layer 1 and maybe 2 (perhaps
even more flexible, e.g. specifying a buffer and optional start and
end indices). I don't think it makes sense for layer 3 since strings
are immutable. I agree with Martin von Loewis that a readv() style API
would be impractical (and I note that Alexander doesn't provide any
use case beyond "it's more efficient").

A use case that I do think is important is reading encoded text data
asynchronously from a socket. This might mean that layers 2 and 3 may
have to be aware of the asynchronous (non-blocking or timeout-driven)
nature of the I/O; reading from layer 3 should give as many characters
as possible without blocking for I/O more than the specified timeout.
We should also decide how asynchronous I/O calls report "no more data"
-- exceptions are inefficient and cause clumsy code, but if we return
"", how can we tell that apart from EOF? Perhaps we can use None to
indicate "no more data available without blocking", continuing "" to
indicate EOF. (The other way around makes just as much sense but would
be a bigger break with Python's past than this particular issue is
worth to me.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)