[Python-Dev] New Py_UNICODE doc

Fri May 6 20:49:00 CEST 2005

On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:

> You've got that wrong: Python let's you choose UCS-4 -
> UCS-2 is the default.
>
> Note that Python's Unicode codecs UTF-8 and UTF-16
> are surrogate aware and thus support non-BMP code points
> regardless of the build type: A UCS2-build of Python will
> store a non-BMP code point as UTF-16 surrogate pair in the
> Py_UNICODE buffer while a UCS4 build will store it as a
> single value. Decoding is surrogate aware too, so a UTF-16
> surrogate pair in a UCS2 build will get treated as single
> Unicode code point.

If this is the case, then we're clearly misleading users.  If the 
configure script says UCS-2, then as a user I would assume that 
surrogate pairs would *not* be encoded, because I chose UCS-2, and it 
doesn't support that.  I would assume that any UTF-16 string I would 
read would be transcoded into the internal type (UCS-2), and 
information would be lost.  If this is not the case, then what does the 
configure option mean?

--
Nick