[Python-Dev] New Py_UNICODE doc
M.-A. Lemburg
mal at egenix.com
Sat May 7 20:41:31 CEST 2005
Shane Hathaway wrote:
> Martin v. Löwis wrote:
>
>>Shane Hathaway wrote:
>>
>>
>>>I agree that UCS4 is needed. There is a balancing act here; UTF-16 is
>>>widely used and takes less space, while UCS4 is easier to treat as an
>>>array of characters. Maybe we can have both: unicode objects start with
>>>an internal representation in UTF-16, but get promoted automatically to
>>>UCS4 when you index or slice them. The difference will not be visible
>>>to Python code. A compile-time switch will not be necessary. What do
>>>you think?
>>
>>
>>This breaks backwards compatibility with existing extension modules.
>>Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and
>>can use that to directly access the characters.
>
>
> Py_UNICODE would always be 32 bits wide. PyUnicode_AsUnicode would
> cause the unicode object to be promoted automatically. Extensions that
> break as a result are technically broken already, aren't they? They're
> not supposed to depend on the size of Py_UNICODE.
-1.
You are free to compile Python with --enable-unicode=ucs4
if you prefer this setting.
I don't see any reason why we should force users to invest 4 bytes
of storage for each Unicode code point - 2 bytes work just fine
and can represent all Unicode characters that are currently
defined (using surrogates if necessary). As more and more
Unicode objects are used in a process, choosing UCS2 vs. UCS4
does make a huge difference in terms of used memory.
All this talk about UTF-16 vs. UCS-2 is not very useful
and strikes me a purely academic.
The reference to possibly breakage by slicing a Unicode and
breaking a surrogate pair is valid, the idea of UCS-4 being
less prone to breakage is a myth:
Unicode has many code points that are meant only for composition
and don't have any standalone meaning, e.g. a combining acute
accent (U+0301), yet they are perfectly valid code points -
regardless of UCS-2 or UCS-4. It is easily possible to break
such a combining sequence using slicing, so the most
often presented argument for using UCS-4 instead of UCS-2
(+ surrogates) is rather weak if seen by daylight.
Some may now say that combining sequences are not used
all that often. However, they play a central role in Unicode
normalization (http://www.unicode.org/reports/tr15/),
which is needed whenever you want to semantically
compare Unicode objects and are
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, May 07 2005)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list