[Python-Dev] New Py_UNICODE doc

Nicholas Bastin nbastin at opnet.com
Fri May 6 22:21:53 CEST 2005


On May 6, 2005, at 3:42 PM, James Y Knight wrote:

> On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
>> If this is the case, then we're clearly misleading users.  If the
>> configure script says UCS-2, then as a user I would assume that
>> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
>> doesn't support that.  I would assume that any UTF-16 string I would
>> read would be transcoded into the internal type (UCS-2), and
>> information would be lost.  If this is not the case, then what does 
>> the
>> configure option mean?
>
> It means all the string operations treat strings as if they were 
> UCS-2, but that in actuality, they are UTF-16. Same as the case in the 
> windows APIs and Java. That is, all string operations are essentially 
> broken, because they're operating on encoded bytes, not characters, 
> but claim to be operating on characters.

Well, this is a completely separate issue/problem. The internal 
representation is UTF-16, and should be stated as such.  If the 
built-in methods actually don't work with surrogate pairs, then that 
should be fixed.

--
Nick



More information about the Python-Dev mailing list