[Python-3000] Thoughts on new I/O library and bytecode

Wed Feb 21 06:52:08 CET 2007

"Guido van Rossum" <guido at python.org> wrote:
> [Note: changed subject]
> On 2/20/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > I'm not so sure.  The return type on socket.recv and os.read could be
> > changed to bytes (seemingly without much difficulty),
> 
> Yes, that's the plan anyway.

Better than returning unicode, but not as good as returning "binary".

> > and likely could
> > even be changed to *take* a bytes object as the destination buffer
> > (ditto for files opened as 'raw').
> 
> This already works -- bytes support the buffer API.

I was thinking of...

    buff = bytes(4096*[0])
    received = sock.recv(buff)

It's really only useful when you have a known protocol with fixed size
blocks, but need it to run more or less forever.  By fixing the buffer
size, you can have significantly reduced memory fragmentation.

> > Then again, I've been "eh?" on the whole I/O library thing, and
> > generally annoyed at the "everything is unicode" idea.
> 
> Well, unless you remove the str type, how are you going to get rid of
> the endless problems with unicode where mixing unicode and str
> sometimes works and sometimes doesn't?

Ooh, one of my favorite games!

* Explicit <conversion to unicode> is better than implicit.
* In the face of ambiguity, refuse the temptation to guess <what codec
to use to decode the string>.
* Errors <when adding strings to unicode> should never pass silently.

There are at least two approaches to solving the problem:
1) make everything unicode
2) make all implicit conversions an error.

Adding strings to unicode should produce an exception.  The fact that it
doesn't right now, I believe, is both a result of implementation details
getting in the way of what should happen. Remove the ambiguity, codec
guessing, etc., raise a TypeError("cannot concatenate str and unicode
objects"), and move on.

Don't allow up-casting in u''.join() or ''.join() (or their equivalents
in py3k).

> > Converting all
> > libraries that currently deal with IO is going to be a pain, especially
> > if it does any sort of parsing of mixed binary and non-unicode textual
> > data (like http headers combined with binary posted data or a utf-8
> > encoded stream).
> 
> Yeah, I'm not looking forward to that, but I expect it'll be
> relatively straightforward once we figure out the right patterns;
> there's just a lot of code to convert. But that's the whole Py3k plan.

No offense, but the plan to convert it all to use bytes, stinks.
Starting with the API defined in PEP 358, I started converting smtpd (as
an example), and I found myself *wanting* to use unicode because the
whole numeric constants and/or bytes('unicode', 'latin-1') got really
old really fast.

> > As a heavy user of quite a few of the current standard library IO
> > modules (SocketServer, asyncore, urllib, socket, etc.) and as someone
> > who has the "opportunity" to write line-level protocols, I'd be quite
> > happy with the following...
> >
> > 1) add bytes (or add features to array)
> > 2) rename unicode to text (or str)
> > 3) renaming str to bin (or some other sufficiently clear name)
> 
> So you'd have THREE types (bytes, text, bin)? Or are you proposing bin
> instead of bytes, contrary to what you suggested above?

While I would have some personal uses for bytes, all of them could be
fulfilled with an expanded array type.  If I could have my way
<dreaming>I'd rename string and unicode, fold some of the features of
bytes into array, and make socket, etc., return the renamed string
type</dreaming>. In the case of the standard library that deal with
sockets, the only changes would generally be a replacing of 'const' to
b'const'.  That could *almost* be automatic, and would be significantly
faster (for a computer + human) than converting all of the .split(),
.find(), etc., uses in the ftplib, *Server, smtplib, smtpd, etc. to
bytes eqivalents (or converting to and from unicode).

It would take me perhaps 20 minutes to update asyncore, asynchat and
smtpd with the b'binary' semantic.  Based on the last list of methods I
saw for bytes in PEP 358, I would be, more or less, doing bytes.decode
('latin-1') instead of trying to deal with the *crippled* interface that
bytes offers.

Regardless, the performance of those modules would likely suffer when
confronted with bytes rather than a renamed str, as the current bytes
type lacks a large number of convenience methods, that I previously
complained about it not having (which is why I brought up the string
view and sample implementation in late August/early September 2006).

> > 4) making string literals 'hello' be unicode
> > 5) allow for b'constant' be the renamed str
> > 6) add a mandatory 3rd argument to file/open which is the codec to use
> > for reading
> 
> And how does that help users or compatibility?

Users who need binary literals (like every socket module in the standard
library, anyone who does processing of any non-unicode disk/socket/pipe
data, like marshal or pickle, etc.) wouldn't go insane and add bugs
trying to switch to the bytes type, or add performance overhead trying
to convert the received bytes to unicode to get a useful API.

> > 7) offer a new function for opening 'binary' files (which are opened as
> > 'rb' or 'wb' whenever 'r' or 'w' are passed, respectively), which will
> > remove confusion on Windows platforms
> 
> This is a red herring. Or I'm not sure I understand this part of your
> proposal. What's wrong with 'rb'?

Presumption:
    a = open(filename, 'r' or 'w' ['+'], codec)
will open a file as unicode in Py3k (if I am wrong, please correct me).

Proposal:
    b = somename(filename, 'r' or 'w' ['+'])
will be equivalent to:
    b = open(filename, 'rb' or 'wb' ['+'])
today.  This prevents the confusion over different argument values
resulting in different types being returned and accepted by certain
methods.

> > Indeed, it isn't as revolutionary as "everything is unicode", but it
> > would allow the standard library to be updated with a relative minimum
> > of fuss and muss, without needing to intermix...
> >     x = bytes.decode('latin-1').USEFUL_UNICODE_METHOD(...)
> > or
> >     sock.send(unicode.encode('latin-1'))
> 
> Actually, with the renamings and everything, it's just about as
> disruptive as the current proposal, so I'm unclear why you think this
> is so different.

    sock.send(b'Header: value\r\n')
              ^
The above change can be more or less automatic.  The below?

    sock.send(bytes('Header: value\r\n', 'latin-1'))

    sock.send('Header: value\r\n'.encode('latin-1'))

Either of the above is 17 characters of noise that really shouldn't need
to be there.

 - Josiah