[Python-3000] On PEP 3116: new I/O base classes

Fri Jun 22 02:39:34 CEST 2007

On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
> > > TextIOBase: this seems an odd mix of high-level and low-level.  I'd
> > > remove "seek", "tell", "read", and "write".  Remember that in Python,
> > > mixins actually work, so that you can provide a file object that
> > > combines several different I/O classes.
> >
> > Huh? All those operations you want to remove are entirely necessary
> > for a number of applications. I'm not sure what you meant about mixins?
>
> I meant that TextIOBase should just provide the operations for text.
> The other operations would be supported, when appropriate, by mixing
> in an appropriate class that provides them.  Remember that this is
> a PEP about base classes.

Um, it's not meant to be just about base classes -- it's also meant to
be about the actual implementations -- both abstract and concrete
classes will be importable from the same module, 'io'. Have you
checked out io.py in the p3yk branch?

> > It doesn't work? Why not? Of course read() should take the number of
> > characters as a parameter, not number of bytes.
>
> Unfortunately, files contain encodings of characters, and those
> encodings may at times be mapped to multiple equivalent strings, at
> least with respect to Unicode, the target for Python-3000.  The
> standard Unicode support for Python-3000 seems to be settling on
> having code-point representations of those strings exposed to the
> application, which means that any specific automatic normalization is
> precluded.  So any particular "readchars(1)" operation may validly
> return different strings even if operating on the same underlying
> file, and may require a different number of read operations to read
> the same underlying bytes.  That is, I believe that the string and/or
> file operations are not well-specified enough to guarantee that this
> won't happen.  This is the same situation we have today, which means
> that the only real way to read Unicode strings from a file will be the
> same as today, that is, read raw bytes from a file, decode them and
> normalize them in some specific way, and then see what string you wind
> up with.  You could probably fix this in the PEP by specifying a
> specific Unicode normalization to use when returning strings.

I don't understand exactly what you're saying, but here's the semantic
model from which I've been operating.

A file contains a sequence of bytes. If you read it all in one fell
swoop, and then decoded it to Unicode (using a specific encoding),
you'd get a specific text string. This is a sequence of code units.
(Whether they are valid code points or characters I don't think we can
guarantee -- I use the GIGO principle.)

*Conceptually*, read(n) simply returns the next n code units;
readline() is equivalent to read(n) for some n, whose value is
determined by looking ahead until the first \n is found.

Universal newlines collapse \r\n into \n and turn lone \r into \n (or
whatever algorithm is deemed right, I'm not sure the latter is still
needed) *before* we reach the sequence of code points that read() and
readline() see.

Files are all about making this conceptual model efficient even if the
file doesn't fit in memory. We have incremental codecs which make this
possible. (We always assume the file doesn't change while we're
reading it; if it does, certain bets are off.)

In my mind, seek() and tell() should work like getpos() and setpos()
in modern C stdio -- tell() returns a "cookie" whose only use is that
you can later pass it to seek() and it will reset the position in the
sequence of code units to where it was when tell() was called. For
many encodings, in practice, seek() and tell() can just use byte
positions since the boundaries between code points always fall on byte
boundaries (but not the other way around). For other encodings, the
implementation currently in io.py encodes the incremental codec state
in the (very) high bits of the cookie (this is convenient since we
have arbitrary precision integers).

Relative seeks (except for a few end cases) are not supported for text files.

> > > feel the need.  Stick to just "readline" and "writeline" for text I/O.
> >
> > Ah, not everyone dealing with text is dealing with line-delimited
> > text, you know...
>
> It's really the only difference between text and non-text.

Again, I don't quite follow this.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)