[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 31 08:03:52 CEST 2011

Guido van Rossum writes:
 > On Tue, Aug 30, 2011 at 7:55 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

 > > For starters, one that doesn't ever return lone surrogates, but rather
 > > interprets surrogate pairs as Unicode code points as in UTF-16.  (This
 > > is not a Unicode standard definition, it's intended to be suggestive
 > > of why many app writers will be distressed if they must use Python
 > > unicode/str in a narrow build without a fairly comprehensive library
 > > that wraps the arrays in operations that treat unicode/str as an array
 > > of code points.)
 > 
 > That sounds like a contradiction -- it wouldn't be a UTF-16 array if
 > you couldn't tell that it was using UTF-16.

Well, that's why I wrote "intended to be suggestive".  The Unicode
Standard does not specify at all what the internal representation of
characters may be, it only specifies what their external behavior must
be when two processes communicate.  (For "process" as used in the
standard, think "Python modules" here, since we are concerned with the
problems of folks who develop in Python.)  When observing the behavior
of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
even UTF-32 arrays; only arrays of characters.

Thus, according to the rules of handling a UTF-16 stream, it is an
error to observe a lone surrogate or a surrogate pair that isn't a
high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
C8-C10).  That's what I mean by "can't tell it's UTF-16".  And I
understand those requirements to mean that operations on UTF-16
streams should produce UTF-16 streams, or raise an error.  Without
that closure property for basic operations on str, I think it's a bad
idea to say that the representation of text in a str in a pre-PEP-393
"narrow" build is UTF-16.  For many users and app developers, it
creates expectations that are not fulfilled.

It's true that common usage is that an array of code units that
usually conforms to UTF-16 may be called "UTF-16" without the closure
properties.  I just disagree with that usage, because there are two
camps that interpret "UTF-16" differently.  One side says, "we have an
array representation in UTF-16 that can handle all Unicode code points
efficiently, and if you think you need more, think again", while the
other says "it's too painful to have to check every result for valid
UTF-16, and we need a UTF-16 type that supports the usual array
operations on *characters* via the usual operators; if you think
otherwise, think again."

Note that despite the (presumed) resolution of the UTF-16 issue for
CPython by PEP 393, at some point a very similar discussion will take
place over "characters" anyway, because users and app developers are
going to want a type that handles composition sequences and/or
grapheme clusters for them, as well as comparison that respects
canonical equivalence, even if it is inefficient compared to str.
That's why I insisted on use of "array of code points" to describe the
PEP 393 str type, rather than "array of characters".