[Python-Dev] New Py_UNICODE doc

Sat May 7 06:04:47 CEST 2005

On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote:

> Nicholas Bastin wrote:
>> Yes.  Not only in my mind, but in the Python source code.  If
>> Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4),
>> otherwise the encoding is UTF-16 (*not* UCS-2).
>
> I see. Some people equate "encoding" with "encoding scheme";
> neither UTF-32 nor UTF-16 is an encoding scheme. You were

That's not true.  UTF-16 and UTF-32 are both CES and CEF (although this 
is not true of UTF-16LE and BE).  UTF-32 is a fixed-width encoding form 
within a code space of (0..10FFFF) and UTF-16 is a variable-width 
encoding form which provides a mix of one of two 16-bit code units in 
the code space of (0..FFFF).  However, you are perhaps right to point 
out that people should be more explicit as to which they are referring 
to.  UCS-2, however, is only a CEF, and thus I thought it was obvious 
that I was referring to UTF-16 as a CEF.  I would point anyone who is 
confused as this point to Unicode Technical Report #17 on the Character 
Encoding Model, which is much more clear than trying to piece together 
the relevant parts out of the entire standard.

In any event, Python's use of the term UCS-2 is incorrect.  I quote 
from the TR:

"The UCS-2 encoding form, which is associated with ISO/IEC 10646 and 
can only express characters in the  BMP, is a fixed-width encoding 
form."

immediately followed by:

"In contrast, UTF-16 uses either one or two code  units and is able to 
cover the entire code space of Unicode."

If Python is capable of representing the entire code space of Unicode 
when you choose --unicode=ucs2, then that is a bug.  It either should 
not be called UCS-2, or the interpreter should be bound by the 
limitations of the UCS-2 CEF.

>> What I mean by 'variable' is that you can't make any assumption as to
>> what the size will be in any given python when you're writing (and
>> building) an extension module.  This breaks binary compatibility of
>> extensions modules on the same platform and same version of python
>> across interpreters which may have been built with different configure
>> options.
>
> True. The breakage will be quite obvious, in most cases: the module
> fails to load because not only sizeof(Py_UNICODE) changes, but also
> the names of all symbols change.

Yes, but the important question here is why would we want that?  Why 
doesn't Python just have *one* internal representation of a Unicode 
character?  Having more than one possible definition just creates 
problems, and provides no value.

--
Nick