[Python-Dev] thoughts on the bytes/string discussion

Sat Jun 26 00:40:52 CEST 2010

Guido van Rossum <guido at python.org> wrote:

> On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz
> <glyph at twistedmatrix.com> wrote:
> >
> > On Jun 24, 2010, at 4:59 PM, Guido van Rossum wrote:
> >
> > Regarding the proposal of a String ABC, I hope this isn't going to
> > become a backdoor to reintroduce the Python 2 madness of allowing
> > equivalency between text and bytes for *some* strings of bytes and not
> > others.

I never actually replied to this...  Absolutely right, which is why you
might really want another kind of string, rather than a way to treat
some bytes values as strings in some places.  Both Python 2 and Python 3
are missing one of the three types.  Python 1 and 2 didn't have "bytes",
and this caused problems because "str" was pressed into use to hold
arbitrary byte sequences.  (Python 2 "str" has other problems as well,
like losing track of the encoding.)  Python 3 doesn't have Python 2's
"str" (encoded string), and bytes are being pressed into use for that.
Each of these uses is an ad hoc hijack of an inappropriate type, and
additional frameworks not directly supported by the Python language are
being jury-rigged to try to support the uses.

On the other hand, this is all in the eye of the beholder.  Both byte
sequences and strings are horrible formless things; they remind me of
BLISS.  You seldom really have a byte sequence; what you have is an XDR
float or an encoded string or an IP header or an email message.
Similarly for strings; they are really file names or city names or
English sentences or URIs or other things with significant semantic
constraints not captured by the typical type system.  So, yes, there
*is* an inescapable equivalency between text and bytes for *some*
sequences of bytes (those that represent encoded strings) and not others
(those that contain the XDR float, for instance).  Creating a separate
encoded string type would be one way to keep that straight.

> > For my part, what I want out of a string ABC is simply the ability to do
> > application-specific optimizations.
> > There are many applications where all input and output is text, but _must_
> > be UTF-8.  Even GTK uses UTF-8 as its native text representation, so
> > "output" could just be display.
> > Right now, in Python 3, the only way to be "correct" about this is to copy
> > every byte of input into 4 bytes of output, then copy each code point *back*
> > into a single byte of output.  If all your application does is rewrite the
> > occasional XML attribute, for example, this cost can be significant, if not
> > overwhelming.
> > I'd like a version of 'decode' which would give me a type that was, in every
> > respect, unicode, and responded to all protocols exactly as other
> > unicode objects (or "str objects", if you prefer py3 nomenclature ;-)) do,
> > but wouldn't actually copy any of that memory unless it really needed to
> > (for example, to pass to a C API that expected native wide characters), and
> > that would hold on to the original bytes so that it could produce them on
> > demand if encoded to the same encoding again. So, as others in this thread
> > have mentioned, the 'ABC' really implies some stuff about C APIs as well.

Seems like it.

> > I'm not sure about the exact performance impact of such a class, which is
> > why I'd like the ability to implement it *outside* of the stdlib and see how
> > it works on a project, and return with a proposal along with some data.

Yes, exactly.

> >  There are also different ways to implement this, and other optimizations
> > (like ropes) which might be better.
> > You can almost do this today, but the lack of things like the hypothetical
> > "__rcontains__" does make it impossible to be totally transparent about it.
> 
> But you'd still have to validate it, right? You wouldn't want to go on
> using what you thought was wrapped UTF-8 if it wasn't actually valid
> UTF-8 (or you'd be worse off than in Python 2).

Yes, but there are different ways to validate it that have different
performance impacts.  Simply trusting the source of the string, for
example, would be appropriate in some cases.

> So you're really just worried about space consumption. I'd like to see
> a lot of hard memory profiling data before I got overly worried about
> that.

While I've seen some big Web pages, I think the email folks, who often
have to process messages with attachments measuring in the tens of
megabytes, have the stronger problems here, and I think speed may be
more important than memory.  I've built both a Web server and an IMAP
server in Python, and the IMAP server is where the issues of storage
management really prevail.  If you have to convert a 20 MB encoded
string into a Unicode string just to look at the headers as strings, you
have issues.  (The Python email package doesn't do that, by the way.)

Bill