[Python-Dev] New Py_UNICODE doc
Nicholas Bastin
nbastin at opnet.com
Sat May 7 06:04:47 CEST 2005
On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> Yes. Not only in my mind, but in the Python source code. If
>> Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4),
>> otherwise the encoding is UTF-16 (*not* UCS-2).
>
> I see. Some people equate "encoding" with "encoding scheme";
> neither UTF-32 nor UTF-16 is an encoding scheme. You were
That's not true. UTF-16 and UTF-32 are both CES and CEF (although this
is not true of UTF-16LE and BE). UTF-32 is a fixed-width encoding form
within a code space of (0..10FFFF) and UTF-16 is a variable-width
encoding form which provides a mix of one of two 16-bit code units in
the code space of (0..FFFF). However, you are perhaps right to point
out that people should be more explicit as to which they are referring
to. UCS-2, however, is only a CEF, and thus I thought it was obvious
that I was referring to UTF-16 as a CEF. I would point anyone who is
confused as this point to Unicode Technical Report #17 on the Character
Encoding Model, which is much more clear than trying to piece together
the relevant parts out of the entire standard.
In any event, Python's use of the term UCS-2 is incorrect. I quote
from the TR:
"The UCS-2 encoding form, which is associated with ISO/IEC 10646 and
can only express characters in the BMP, is a fixed-width encoding
form."
immediately followed by:
"In contrast, UTF-16 uses either one or two code units and is able to
cover the entire code space of Unicode."
If Python is capable of representing the entire code space of Unicode
when you choose --unicode=ucs2, then that is a bug. It either should
not be called UCS-2, or the interpreter should be bound by the
limitations of the UCS-2 CEF.
>> What I mean by 'variable' is that you can't make any assumption as to
>> what the size will be in any given python when you're writing (and
>> building) an extension module. This breaks binary compatibility of
>> extensions modules on the same platform and same version of python
>> across interpreters which may have been built with different configure
>> options.
>
> True. The breakage will be quite obvious, in most cases: the module
> fails to load because not only sizeof(Py_UNICODE) changes, but also
> the names of all symbols change.
Yes, but the important question here is why would we want that? Why
doesn't Python just have *one* internal representation of a Unicode
character? Having more than one possible definition just creates
problems, and provides no value.
--
Nick
More information about the Python-Dev
mailing list