[Python-Dev] UCS2/UCS4 default

M.-A. Lemburg mal at egenix.com
Thu Jul 3 21:16:03 CEST 2008


On 2008-07-03 19:35, Jeroen Ruigrok van der Werven wrote:
> -On [20080703 19:21], Adam Olsen (rhamph at gmail.com) wrote:
>> On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>> Please remember that lone surrogate pair code points are perfectly
>>> valid Unicode code points, nevertheless. Just as a lone combining
>>> code point is valid on its own.
>> That is a big part of these problems.  For all practical purposes, a
>> surrogate is like a UTF-8 code unit, and must be handled the same way,
>> so why the heck do they confuse everybody by saying "oh, it's a code
>> point too!"?
> 
> Because surrogate code points are not Unicode scalar values, isolated UTF-16
> code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode
> 5.0/5.1, section 3.9)

True. They are not valid UTF-16 code units, but a code unit is
just a storage byte representation of a Unicode tranformation...

"""
Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The 
Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 
32-bit code units in the UTF-32 encoding form. (See definition D77 in  Section 3.9, Unicode Encoding Forms.)
"""

That's not the same thing as a code point which is an assignment
of a slot in the Unicode character set...

"""
Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 
in Section 3.4, Characters and Encoding.)
"""

Reference: http://www.unicode.org/glossary/

Also see Chapter 3.4 (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G2212):

"""
Surrogate code points and noncharacters are considered assigned code points,
but not assigned characters.
"""

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania             3 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611


More information about the Python-Dev mailing list