[Python-Dev] 2.2 Unicode questions

Andrew Kuchling akuchlin@mems-exchange.org
Thu, 19 Jul 2001 10:57:37 -0400


On Thu, Jul 19, 2001 at 10:15:49AM -0400, Simon Cozens wrote:
>If by UCS-2 you actually mean UTF-16, then using surrogates is the
>right approach. :)

<head explodes> If a narrow Python uses UTF-16 (and it does seem to,
according to PEP 100), then the configure script's
--enable-unicode=ucs2 option should be changed, because it's
misleading.

Here's another pass:

%======================================================================
\section{Unicode Changes}

Python's Unicode support has been enhanced a bit in 2.2.  Unicode
strings are usually stored as UTF-16, as 16-bit unsigned integers.
Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
integers, as its internal encoding by supplying
\longprogramopt{enable-unicode=ucs4} to the configure script.  When
built to use UCS-4 (a ``wide Python''), the interpreter can natively
handle Unicode characters from U+000000 to U+110000.  The range of
legal values for the \function{unichr()} function has been expanded;
it used to only accept values up to 65535, but in 2.2 will accept
values from 0 to 0x110000.  Using a ``narrow Python'', an interpreter
compiled to use UTF-16, values greater than 65535 will result in
\function{unichr()} returning a string of length 2:

\begin{verbatim}
>>> s = unichr(65536)
>>> s
u'\ud800\udc00'
>>> len(s)
2
\end{verbatim}

This possibly-confusing behaviour, breaking the intuitive invariant
that \function{chr()} and\function{unichr()} always return strings of
length 1, may be changed later in 2.2, depending on public reaction.

All this is the province of the still-unimplemented PEP 261, ``Support
for `wide' Unicode characters''; consult it for further details, and
please offer comments and suggestions on the proposal it describes.

--amk