[I18n-sig] Python Support for "Wide" Unicode characters

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 19:19:49 -0400


Nice job, Paul!  I especially like the notion of narrow and wide
Pythons. :-)

In the style of the PEP process, there should probably be some
discussion of the alternatives that were proposed, considered and
rejected, in particular (1) place the burden of surrogate handling on
the application, possibly with limited library support, and (2) try to
mend the unicode string object so that it is always indexed in
characters, even if it contains surrogates.

> PEP: 261
> Title: Python Support for "Wide" Unicode characters
> Version: 1.0
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Python-Version: 2.2
> Created: 27-Jun-2001
> Post-History: 27-Jun-2001

I think PEPs should get wider distribution than a SIG.  Maybe after
the first round of comments on i18n-sig is over you can post it to
c.l.py(.a) and python-dev?

> Abstract
> 
>     Python 2.1 unicode characters can have ordinals only up to 65536. 
>     These characters are known as Basic Multilinual Plane characters.
>     There are now characters in Unicode that live on other "planes".
>     The largest addressable character in Unicode has the ordinal
>     2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.

I would express this as 17 * 2**16 - 1, to emphasize the fact that
there are 17 planes of 2**16 characters each.

> Proposed Solution
> 
>     One solution would be to merely increase the maximum ordinal to a
>     larger value. Unfortunately the only straightforward implementation
>     of this idea is to increase the character code unit to 4 bytes. This
>     has the effect of doubling the size of most Unicode strings. In
>     order to avoid imposing this cost on every user, Python 2.2 will
>     allow 4-byte Unicode characters as a build-time option.
> 
> 
>     The 4-byte option is called "wide Py_UNICODE". The 2-byte option
>     is called "narrow Py_UNICODE".
> 
>     Most things will behave identically in the wide and narrow worlds.
> 
>     * the \u  and \U literal syntaxes will always generate the same
>       data that the unichr function would. They are just different
>       syntaxes for the same thing.
> 
>     * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
> 
>     * unichr(i) for 2**16+1 <= i <= TOPCHAR will always
>       return a string representing the character. 
> 
>     * BUT on narrow builds of Python, the string will actually be
>       composed of two characters called a "surrogate pair".

Can't call these characters.  Maybe use "characters" in quotes, maybe
use code points or items.

>     * ord() will now accept surrogate pairs and return the ordinal of
>       the "wide" character. Open question: should it accept surrogate
>       pairs on wide Python builds?

After thinking about it, I think it should.  Apps that are written
specifically to handle surrogates (e.g. a conversion tool to remove
surrogates!) should work on wide interpreters, and ord() is the only
way to get the character value from a surrogate pair (short from
implementing the shifts and masks yourself, which is doable but a
pain).

>     * There is an integer value in the sys module that describes the
>       largest ordinal for a Unicode character on the current
>       interpreter. sys.maxunicode is 2**16-1 on narrow builds of
>       Python. On wide builds it could be either TOPCHAR
>       or 2**32-1. That's an open question.

Given its name I think it should be TOPCHAR, even if unichr() accepts
larger values.

>     * Note that ord() can in some cases return ordinals
>       higher than sys.maxunicode because it accepts surrogate pairs
>       on narrow Python builds.
> 
>     * codecs will be upgraded to support "wide characters". On narrow
>       Python builds, the codecs will generate surrogate pairs, on 
>       wide Python builds they will generate a single character.

Maybe add a note that this is the main thing that hasn't been fully
implemented yet; everything else except the extended ord() is
implemented now, AFAIK.

>     * new codecs will be written for 4-byte Unicode and older codecs
>       will be updated to recognize surrogates and map them to wide
                                     ^^^^^^^^^^
Make that "surrogate pairs"

>       characters on wide Pythons.
> 
>     * there are no restrictions on constructing strings that use 
>       code points "reserved for surrogates" improperly. These are
>       called "lone surrogates". The codecs should disallow reading
>       these but you could construct them using string literals or
>       unichr().
> 
> Implementation
> 
>     There is a new (experimental) define in Include/unicodeobject.h:
> 
>         #undef USE_UCS4_STORAGE
> 
>     if defined, Py_UNICODE is set to the same thing as Py_UCS4.
> 
>         USE_UCS4_STORAGE

USE_UCS4_STORAGE is no more.  Long live Py_UNICODE_SIZE (2 or 4).

>     There is a new configure options:
> 
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                         wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
>         --enable-unicode      configures Py_UNICODE to wchar_t if
> available,
>                               and to UCS-4 if not; this is the default

Not any more; the default is ucs2 now.

>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.
> 
> Notes
> 
>     Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.
> 
>     This means (for example) that the following code is not portable:
> 
>     x = 0x10000
>     if unichr(x) in somestring:
>         ...
> 
>     In general, you should be careful using "in" if the character
>     that is searched for could have been generated from unichr applied
>     to a number greater than 0x10000 or from a string literal greater
>     than 0x10000.

I suppose we *could* fix the __contains__ implementation for Unicode
objects, but I'm -0 on that.

>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding. It only allows them to do so. For example, ASCII
>     is still a legitimate (7-bit) Unicode-encoding.
> 
> Open Questions
> 
>     "Code points" above TOPCHAR cannot be expressed in two 16-bit
>     characters. These are not assigned to Unicode characters and 
>     supposedly will never be. Should we allow them to be passed as 
>     arguments to unichr() anyhow? We could allow knowledgable
>     programmers to use these "unused" characters for whatever
>     they want, though Unicode does not address them.
> 
>     "Lone surrogates" "should not" occur on wide platforms. Should
>     ord() still accept them?

Unclear what you tried to say here.  You already explained that there
are no restrictions on the use of lone surrogates, so ord() has no
choice (It would be pretty bad if you could construct a 1-code-point
string but ord() could't tell you what that code point was).  Or did
you mean "should ord() accept surrogate pairs?  That question was
already asked above.  Or did you mean this to be a summary of all open
issues?  Then there are several more.

Nit: there's no copyright clause.  All PEPs should have one.

Again, thanks!!!

--Guido van Rossum (home page: http://www.python.org/~guido/)