[I18n-sig] Re: [Python-Dev] PEP 261, Rev 1.3 - Support for "wide" Unicodecharacters

Paul Prescod paulp@ActiveState.com
Mon, 02 Jul 2001 07:25:55 -0700


"M.-A. Lemburg" wrote:
> 
>...
> >     Character
> >
> >         Used by itself, means the addressable units of a Python
> >         Unicode string.
> 
> Please add: also known as "code unit".

I'm not entirely comfortable with that. As you yourself pointed out, the
same Python Unicode object can be interpreted as either a series of
single-width code points *or* as a UTF-16 string where the characters
are code units. You could also interpet it as a BASE64'd region or an
XML document... It all depends on how you look at it.

> ....
> >     Surrogate pair
> >
> >         Two physical characters that represent a single logical
> 
> Eeek... two code units (or have you ever seen a physical character
> walking around ;-)

No, that's sort of my point. The user can decide to adopt the convention
of looking at the two characters as code units or they can ignore that
interpretation and look at them as two code points. It's all relative,
man. Dig it? That's why I use the word "convention" below:

> >         character. Part of a convention for representing 32-bit
> >         code points in terms of two 16-bit code points.

"Surrogates are all in your head. Python doesn't know or care about
them!"

I'll change this to:

    Surrogate pair

        Two Python Unicode characters that represent a single logical
        Unicode code point. Part of a convention for representing
        32-bit code points in terms of two 16-bit code points. Python
        has limited support for reading, writing and constructing
strings 
        that use this convention (described below). Otherwise Python
        ignores the convention.

> No need to pass this information to the codec: simply write
> a new one and give it a clear name, e.g. "ucs-2" will generate
> errors while "utf-16-le" converts them to surrogates.

That's a good point, but what if I want a UTF-8 codec that doesn't
generate surrogates? Or even a UCS4 one?

> Plus perhaps the Mark Davis paper at:
> 
> http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/

Okay.

> > Copyright
> >
> >     This document has been placed in the public domain.
> 
> Good work, Paul !

Thanks for your help. You did help me to clarify many things even though
I argued with you as I was doing it. 
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook