[Python-Dev] UCS2/UCS4 default

M.-A. Lemburg mal at egenix.com
Thu Jul 3 21:07:04 CEST 2008


On 2008-07-03 19:21, Adam Olsen wrote:
> On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>> On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
>>> -On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:
>>>> Unicode if full of combining code points - if you break such a sequence,
>>>> the output will be just as wrong; regardless of UCS2 vs. UCS4.
>>> In my opinion you are confusing two related, but very separated things
>>> here.
>>> Combining characters have nothing to do with breaking up the encoding of a
>>> single codepoint. Sure enough, if you arbitrary slice up codepoints that
>>> consist of combining characters then your result is indeed odd looking.
>>>
>>> I never said that nor is that the point I am making.
>> Please remember that lone surrogate pair code points are perfectly
>> valid Unicode code points, nevertheless. Just as a lone combining
>> code point is valid on its own.
> 
> That is a big part of these problems.  For all practical purposes, a
> surrogate is like a UTF-8 code unit, and must be handled the same way,
> so why the heck do they confuse everybody by saying "oh, it's a code
> point too!"?

You have to take that up with the Unicode consortium :-)

It would have been better not to add surrogates to the standard
at all. To be fair, I don't think that anybody seriously assumed
at the time that more than 16 bits would be needed.

In practice, you do need to be able to build Unicode strings
that contain half a surrogate (ie. a single code point) or
a combining code point without its anchor code point, so trying
to be smart about detecting surrogates is going to create more
confusion than do good, e.g.

 >>> x1 = u'\udbc0'
 >>> x2 = u'\udc00'
 >>> x1
u'\udbc0'
 >>> x2
u'\udc00'
 >>> len(x1)
1
 >>> len(x2)
1

Having len(x1+x2) == 1 wouldn't be right and break all sorts
of assumptions you normally make about string concatenation.
Which is why len(x1+x2) gives 2 in both UCS2 and UCS4 builds.

The fact that u'\U00100000' can map to a length 1 Unicode string
in UCS4 builds and a length 2 string in UCS2 builds is merely
due to the fact that the unicode-escape codec (which converts
the escaped string literal to a Unicode object) does know about
surrogates and uses them to avoid exceptions.

Programmers need to be aware of this fact, that's all...
just like they need to aware of differences between
integer and float division, different behavior of classic
and new-style classes, etc. etc.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania             3 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611


More information about the Python-Dev mailing list