[I18n-sig] Python Support for "Wide" Unicode characters
Paul Prescod
paulp@ActiveState.com
Wed, 27 Jun 2001 17:20:39 -0700
Guido van Rossum wrote:
>
>...
>
> In the style of the PEP process, there should probably be some
> discussion of the alternatives that were proposed, considered and
> rejected, in particular (1) place the burden of surrogate handling on
> the application, possibly with limited library support,
> and (2) try to
> mend the unicode string object so that it is always indexed in
> characters, even if it contains surrogates.
Okay.
>
> I think PEPs should get wider distribution than a SIG. Maybe after
> the first round of comments on i18n-sig is over you can post it to
> c.l.py(.a) and python-dev?
I agree. That's what I intended. I thought it would be confusing if I
posted to the other areas before I had all of my facts right.
> I would express this as 17 * 2**16 - 1, to emphasize the fact that
> there are 17 planes of 2**16 characters each.
Done.
> > * BUT on narrow builds of Python, the string will actually be
> > composed of two characters called a "surrogate pair".
>
> Can't call these characters. Maybe use "characters" in quotes, maybe
> use code points or items.
I think they ARE characters in the Python, not Unicode sense. So I said:
* BUT on narrow builds of Python, the string will actually be
composed of two characters (in the Python, not Unicode sense)
called a "surrogate pair". These two Python characters are
logically one Unicode character.
> > * There is an integer value in the sys module that describes the
> > largest ordinal for a Unicode character on the current
> > interpreter. sys.maxunicode is 2**16-1 on narrow builds of
> > Python. On wide builds it could be either TOPCHAR
> > or 2**32-1. That's an open question.
>
> Given its name I think it should be TOPCHAR, even if unichr() accepts
> larger values.
Maybe there is a virtue in having a way to both ask for the largest
*legal* Unicode character and the largest character that will fit into a
Python character on the platform. I mean in theory the maximum Unicode
character is constant but that doesn't mean I want to declare it in my
programs explicitly.
unicodedata.maxchar => always TOPCHAR
sys.maxunicode => some power of 2 - 1
I'm not entirely happy that we call a thing "sys.maxunicode" and then
tell people how to generate larger values. How about sys.maxcodeunit .
(or we could remove the whole surrogate building stuff :) )
Do you want to rule on this or call it an open issue?
> > * Note that ord() can in some cases return ordinals
> > higher than sys.maxunicode because it accepts surrogate pairs
> > on narrow Python builds.
And if sys.maxunicode is TOPCHAR then you can also get ords greater than
sys.maxunicode just by using unichr on values larger than
sys.maxunicode.
> > * codecs will be upgraded to support "wide characters". On narrow
> > Python builds, the codecs will generate surrogate pairs, on
> > wide Python builds they will generate a single character.
>
> Maybe add a note that this is the main thing that hasn't been fully
> implemented yet; everything else except the extended ord() is
> implemented now, AFAIK.
Done.
> > * new codecs will be written for 4-byte Unicode and older codecs
> > will be updated to recognize surrogates and map them to wide
> ^^^^^^^^^^
> Make that "surrogate pairs"
Done.
> > USE_UCS4_STORAGE
>
> USE_UCS4_STORAGE is no more. Long live Py_UNICODE_SIZE (2 or 4).
Okay.
> > There is a new configure options:
> >
> > --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
> > wchar_t if it fits
> > --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
> > --enable-unicode configures Py_UNICODE to wchar_t if
> > available,
> > and to UCS-4 if not; this is the default
>
> Not any more; the default is ucs2 now.
So there is no way to get the heuristic of "wchar_t if available, UCS-4
if not". I'm not complaining, just checking. The list of options is just
two with ucs2 the default.
>... Or did you mean this to be a summary of all open
> issues? Then there are several more.
What are the open issues in your mind...I'm not clear on what things
you've expressed an opinion on and what things you've ruled on.
> Nit: there's no copyright clause. All PEPs should have one.
Okay.
When I hear from you I'll update it.
--
Take a recipe. Leave a recipe.
Python Cookbook! http://www.ActiveState.com/pythoncookbook