[Python-Dev] UCS2/UCS4 default

Fri Jul 4 01:50:51 CEST 2008

On Thu, Jul 3, 2008 at 4:21 PM, Guido van Rossum <guido at python.org> wrote:
> On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <rhamph at gmail.com> wrote:
>> On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <tjreedy at udel.edu> wrote:
>>>
>>> The premise is the OP's idea that Python should switch to all UCS4 to create
>>> a more pure ('ideal') situation or the idea that len(s) should count
>>> codepoints (correct term?) for all builds as a matter of purity even though
>>> on it would be time-costly on 16-bit builds as a matter of practicality.
>>
>> Wrong term - code units and code points are equivalent in UTF-16 and
>> UTF-32.  What you're looking for is unicode scalar values.
>
> I don't think so. I have in my lap the Unicode 5.0 standard, which on
> page 102, under UTF-16, states (amongst others):
>
> """
> * In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
> represented as <004D 0439 4E8C D800 DF02>, where <D800 DF02>
> corresponds to U+10302.

The literal interpretation is that the U+10302 code point should get
expanded into <D800 DF02>.  It doesn't say if <D800 DF02> is a pair of
code units or a pair of code points.

> * Because surrogate code points are not Unicode scalar values,
> isolated UTF-16 code units in the range D800[16]..DFFF[16] are
> ill-formed.
> """

So a lone surrogate code unit is not a valid scalar.  It also implies
surrogate code points exist, rather than ruling them out.

> From this I understand they distinguish carefully between code points
> and code units -- D800 is a code unit but not a code point, 10302 is a
> code point but not a (UTF-16) code unit.

I disagree.  They switch between code point and code unit arbitrarily,
never than saying surrogate code points don't exist.

> OTOH outside the context of UTF-8, the surrogates are also referred to
> as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of
> Code Points").

You mean outside the context of UTF-16?  Regarding them as reserved
and lone surrogates as ill-formed code units would have been simpler,
but alas, is not the case.

Regarding changes in 5.1
(http://www.unicode.org/versions/Unicode5.1.0/), I can find this bit
to give some context:

    Rendering Default Ignorable Code Points

    Update the last paragraph on p. 192 of The Unicode Standard,
Version 5.0, in Section 5.20, Default Ignorable Code Points, to read
as follows:

        Replacement Text
        An implementation should ignore all default ignorable code
points in rendering whenever it does not support those code points,
whether they are assigned or not.

        In previous versions of the Unicode Standard, surrogate code
points, private use code points, and some control characters were also
default ignorable code points. However, to avoid security problems,
such characters always should be displayed with a missing glyph, so
that there is a visible indication of their presence in the text. In
Unicode 5.1 these code points are no longer default ignorable code
points. For more information, see UTR #36, "Unicode Security
Considerations."

Clearly they act as if surrogate code points exist.

Finally, we find this in the glossary:

    Unicode Scalar Value. Any Unicode  code point except
high-surrogate and low-surrogate code points. In other words, the
ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See
definition D76 in  Section 3.9, Unicode Encoding Forms.)

Clearly, each surrogate is a valid code point, regardless of encoding.
 A surrogate pair simultaneously represents both one code point (the
scalar value) and two code points (the surrogate code points).  To be
unambiguous you must instead use either code units (always 2 for
UTF-16) or scalar values (always 1 in any encoding).

The OP wanted it to always be 1, so the correct unambiguous term is
scalar value.

> I think the best thing we can do is to use "code points" to refer to
> characters and "code units" to the individual 16-bit values in the
> UTF-16 encoding; this seems compatible with usage elsewhere in this
> thread by most folks.
>
> Also see http://unicode.org/glossary/:
>
> """
> Code Point. Any value in the Unicode codespace; that is, the range of
> integers from 0 to 10FFFF16. (See definition D10 in Section 3.4,
> Characters and Encoding.)
> .
> .
> .
> Code Unit. The minimal bit combination that can represent a unit of
> encoded text for processing or interchange. The Unicode Standard uses
> 8-bit code units in the UTF-8 encoding form, 16-bit code units in the
> UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding
> form. (See definition D77 in  Section 3.9, Unicode Encoding Forms.)
> """
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>

-- 
Adam Olsen, aka Rhamphoryncus