[I18n-sig] Python Support for "Wide" Unicode characters

Paul Prescod paulp@ActiveState.com
Wed, 27 Jun 2001 17:20:39 -0700


Guido van Rossum wrote:
> 
>...
> 
> In the style of the PEP process, there should probably be some
> discussion of the alternatives that were proposed, considered and
> rejected, in particular (1) place the burden of surrogate handling on
> the application, possibly with limited library support, 
> and (2) try to
> mend the unicode string object so that it is always indexed in
> characters, even if it contains surrogates.

Okay.

> 
> I think PEPs should get wider distribution than a SIG.  Maybe after
> the first round of comments on i18n-sig is over you can post it to
> c.l.py(.a) and python-dev?

I agree. That's what I intended. I thought it would be confusing if I
posted to the other areas before I had all of my facts right.

> I would express this as 17 * 2**16 - 1, to emphasize the fact that
> there are 17 planes of 2**16 characters each.

Done.

> >     * BUT on narrow builds of Python, the string will actually be
> >       composed of two characters called a "surrogate pair".
> 
> Can't call these characters.  Maybe use "characters" in quotes, maybe
> use code points or items.

I think they ARE characters in the Python, not Unicode sense. So I said:

    * BUT on narrow builds of Python, the string will actually be
      composed of two characters (in the Python, not Unicode sense)
      called a "surrogate pair". These two Python characters are
      logically one Unicode character.

> >     * There is an integer value in the sys module that describes the
> >       largest ordinal for a Unicode character on the current
> >       interpreter. sys.maxunicode is 2**16-1 on narrow builds of
> >       Python. On wide builds it could be either TOPCHAR
> >       or 2**32-1. That's an open question.
> 
> Given its name I think it should be TOPCHAR, even if unichr() accepts
> larger values.

Maybe there is a virtue in having a way to both ask for the largest
*legal* Unicode character and the largest character that will fit into a
Python character on the platform. I mean in theory the maximum Unicode
character is constant but that doesn't mean I want to declare it in my
programs explicitly.

unicodedata.maxchar => always TOPCHAR
sys.maxunicode => some power of 2 - 1

I'm not entirely happy that we call a thing "sys.maxunicode" and then
tell people how to generate larger values. How about sys.maxcodeunit .
(or we could remove the whole surrogate building stuff :) )

Do you want to rule on this or call it an open issue?

> >     * Note that ord() can in some cases return ordinals
> >       higher than sys.maxunicode because it accepts surrogate pairs
> >       on narrow Python builds.

And if sys.maxunicode is TOPCHAR then you can also get ords greater than
sys.maxunicode just by using unichr on values larger than
sys.maxunicode.

> >     * codecs will be upgraded to support "wide characters". On narrow
> >       Python builds, the codecs will generate surrogate pairs, on
> >       wide Python builds they will generate a single character.
> 
> Maybe add a note that this is the main thing that hasn't been fully
> implemented yet; everything else except the extended ord() is
> implemented now, AFAIK.

Done.

> >     * new codecs will be written for 4-byte Unicode and older codecs
> >       will be updated to recognize surrogates and map them to wide
>                                      ^^^^^^^^^^
> Make that "surrogate pairs"

Done.

> >         USE_UCS4_STORAGE
> 
> USE_UCS4_STORAGE is no more.  Long live Py_UNICODE_SIZE (2 or 4).

Okay.

> >     There is a new configure options:
> >
> >         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
> >                         wchar_t if it fits
> >         --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
> >         --enable-unicode      configures Py_UNICODE to wchar_t if
> > available,
> >                               and to UCS-4 if not; this is the default
> 
> Not any more; the default is ucs2 now.

So there is no way to get the heuristic of "wchar_t if available, UCS-4
if not". I'm not complaining, just checking. The list of options is just
two with ucs2 the default.

>... Or did you mean this to be a summary of all open
> issues?  Then there are several more.

What are the open issues in your mind...I'm not clear on what things
you've expressed an opinion on and what things you've ruled on.

> Nit: there's no copyright clause.  All PEPs should have one.

Okay.

When I hear from you I'll update it.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook