[Python-Dev] New Py_UNICODE doc

Sat May 7 02:01:50 CEST 2005

On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote:

> Nicholas Bastin wrote:
>> If this is the case, then we're clearly misleading users.  If the
>> configure script says UCS-2, then as a user I would assume that
>> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
>> doesn't support that.
>
> What do you mean by that? That the interpreter crashes if you try
> to store a low surrogate into a Py_UNICODE?

What I mean is pretty clear.  UCS-2 does *NOT* support surrogate pairs. 
  If it did, it would be called UTF-16.  If Python really supported 
UCS-2, then surrogate pairs from UTF-16 inputs would either get turned 
into two garbage characters, or the "I couldn't transcode this" UCS-2 
code point (I don't remember which on that is off the top of my head).

>> I would assume that any UTF-16 string I would
>> read would be transcoded into the internal type (UCS-2), and 
>> information
>> would be lost.  If this is not the case, then what does the configure
>> option mean?
>
> It tells you whether you have the two-octet form of the Universal
> Character Set, or the four-octet form.

It would, if that were the case, but it's not.  Setting UCS-2 in the 
configure script really means UTF-16, and as such, the documentation 
should reflect that.

--
Nick