[Python-Dev] len(chr(i)) = 2?

Sun Nov 21 18:38:25 CET 2010

On Sun, 21 Nov 2010 21:55:12 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> "Martin v. LÃ¶wis" writes:
>  > Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
>  > > "Martin v. LÃ¶wis" writes:
>  > >
>  > >  > The term "UCS-2" is a character set that can encode only encode 65536
>  > >  > characters; it thus refers to Unicode 1.1. According to the Unicode
>  > >  > Consortium's FAQ, the term UCS-2 should be avoided these days.
>  > >
>  > > So what do you propose we call the Python implementation?
>  >
>  > A technical correct description would be to say that Python uses either
>  > 16-bit code units or 32-bit code units; for brevity, these can be called
>  > narrow and wide code units.
> 
> I agree that's technically correct.  Unfortunately, it's also useless
> to anybody who doesn't already know more about Unicode than anybody
> should have to know.

[...]

> The point is that internal code is *not* UTF-16 (or -32), but it *is*
> isomorphic to UCS-2 (or -4).  *That is very useful information to
> users*, it's not a technical detail of interest only to Unicode geeks.
> It means that if you stick to defined characters in the BMP when
> giving Python input, then slicing and indexing unicode (Python 2) or
> str (Python 3) objects gives only valid output even in builds with
> 16-bit code units.  OTOH, invalid processing (involving functions like
> 'chr' or input using surrogateescape codecs) can lead to invalid
> output even in builds with 32-bit code units.
> 
> IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what
> they need to know about the limitations of their Python vis-a-vis full
> conformance, at least with respect to the string manipulation functions.

I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
"UCS-2" and "UCS-4" convey almost no information to me, and the bits I
have heard about them on this list have only confused me.  On the other
hand, I understand that 'narrow' means that fewer bytes are used for
each internal character, meaning that some unicode characters need to
be represented by more than one string element, and thus that slicing
strings containing such characters on a narrow build causes problems.
Now, you could tell me the same information using the terms 'UCS-2'
and 'UCS-4' instead of 'narrow' and 'wide', but to my ear 'narrow'
and 'wide' convey a better gut level feeling for what is going on than
'UCS-2' and 'UCS-4' do.  And it avoids any question of whether or not
Python's internal representation actually conforms to whatever standard
it is that UCS refers to, a point on which there seems to be some
dissension.

Having written the above, I googled for UCS-2 and got the Wikipedia
article on UTF16/UCS-2 [1].  Scanning that article, I do not see anything
that would clue me in to the problems of slicing strings in a Python
narrow build.  Indeed, reading that article with my limited unicode
knowledge, if I were told Python used UCS-2, I would assume that non-BMP
characters could not be processed by a Python narrow build.

--
R. David Murray                                      www.bitdance.com

[1] http://en.wikipedia.org/wiki/UTF-16/UCS-2