[Python-Dev] len(chr(i)) = 2?

Mon Nov 22 07:14:46 CET 2010

R. David Murray writes:

 > I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
 > "UCS-2" and "UCS-4" convey almost no information to me, and the bits I
 > have heard about them on this list have only confused me.

OK, point taken.

 > On the other hand, I understand that 'narrow' means that fewer
 > bytes are used for each internal character, meaning that some
 > unicode characters need to be represented by more than one string
 > element, and thus that slicing strings containing such characters
 > on a narrow build causes problems.  Now, you could tell me the same
 > information using the terms 'UCS-2' and 'UCS-4' instead of 'narrow'
 > and 'wide', but to my ear 'narrow' and 'wide' convey a better gut
 > level feeling for what is going on than 'UCS-2' and 'UCS-4' do.

I think that is probably conditioned by your long experience with
Python's Unicode features, specifically the knowledge that Python's
Unicode strings are not arrays of characters, which often is referred
to on this list.

My guess is that very few newbies would know that, and it is not
implied by "narrow".  For example, both Emacs (for sure) and Perl
(IIUC) index strings of variable-width character by characters (at
great expense of performance in Emacs, at least), not as code units.

 > And it avoids any question of whether or not Python's internal
 > representation actually conforms to whatever standard it is that
 > UCS refers to, a point on which there seems to be some dissension.

UCS-2 refers to ISO 10646, Annex 1 IIRC.[1]  Anyway, it's somewhere in
ISO 10646.  I don't think there's actually dissension on conformance
to UCS-2, as that's very easy to achieve.  Rather, Guido explicitly
pronounced that Python processes arrays of code units, not
characters.  My point is that if you pretend that Python is processing
*characters* according to UCS-2 rules for characters, you'll always
come to the same conclusion about what Python will do as if you use
the technically correct terminology of code units.  (At least for the
BMP and UTF-16 private areas.  There will necessarily be some
confusion about surrogates, since in UCS-2 they are characters while
in UTF-16 they're merely "code points", and the Unicode characters
they represent can't be represented at all in UCS-2.)

 > Indeed, reading that article with my limited unicode knowledge, if
 > I were told Python used UCS-2, I would assume that non-BMP
 > characters could not be processed by a Python narrow build.

Actually, I'm almost happy with that.

That is, the precise formulation is "could not be processed *safely
without extra care* by a Python narrow build."  Specifically, AFAIK if
you range check characters that have been indexed out of a string, or
are located at slice boundaries, or produced by chr() or a
surrogateescape input codec, you're safe.  But practically speaking
few apps will actually do those checks and therefore they are unsafe:
processing non-BMP characters can easily lead to show-stopping
Exceptions.  It's very analogous to the kind of show-stopping "bad
character in a header" exception that plagued Mailman for so long, and
had to be fixed on a case-by-case basis.  But the restriction to BMP
characters is much more reasonable (at least for now) than RFC 822's
restriction to ASCII!

But evidently you take it much more stringently.  So the question is,
"what fraction of developers who think as you do would therefore be
put off from using Python to build their applications?"  If most would
say "OK, we'll stick with BMP for now and use UCS-4 or some hack to
deal with extended characters later -- it can't really be true that
it's absolutely impossible to use non-BMP characters," I don't mind
that misunderstanding.

OTOH, yes, it would be bad if the use of "UCS-2" were to imply to more
than a couple of developers that 16-bit builds of Python can't handle
UTF-16 *at all*.

Footnotes: 
[1]  It simply says "we have a subset of the Unicode character set all
of whose code points can be represented in 16 bits, excluding 0xFFFF."
It goes on to define a private area, reserved for use by applications
that will never be standardized, and it says that if you don't know
what a code point in the character area is, don't change it (you can
delete it, however).  ISTR that a later Amendment added 0xFFFE to the
short-list of non-characters.

The surrogate area was taken out of the private area, so a UCS-2
application will simply consider each surrogate to be an unknown
character and pass it through unchanged -- unless it deletes it, or
inserts other characters between the code points of a surrogate pair.
And that's why UCS-2 isn't UTF-16 conforming -- which is basically why
Python isn't either.