[Python-Dev] Support for "wide" Unicode characters

M.-A. Lemburg mal@lemburg.com
Mon, 02 Jul 2001 12:08:53 +0200


Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > The term "character" in Python should really only be used for
> > the 8-bit strings.
> 
> Are we going to change chr() and unichr() to one_element_string() and
> unicode_one_element_string()

No. I am just suggesting to make use of the crispy clear
definitions which the Unicode Consortium has developed for us.
 
> u[i] is a character. If u is Unicode, then u[i] is a Python Unicode
> character. No Python user will find that confusing no matter how Unicode
> knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are.

Except that u[i] maps to a code unit which may or may not be
a code point. Whether a code point matches a grapheme (this
is what users tend to regard as character) is yet another
story due to combining code points.

> > In Unicode a "character" can mean any of:
> 
> Mark Davis said that "people" can use the word to mean any of those
> things. He did not say that it was imprecisely defined in Unicode.
> Nevertheless I'm not using the Unicode definition anymore than our
> standard library uses an ancient Greek definition of integer. Python has
> a concept of integer and a concept of character.

Ok, I'll stop whining. Just as final remark, let me say that
our little discussion is a perfect example of how people can
misunderstand each other by using the terms in different ways
(Kant tried to solve this for Philosophy and did not succeed;
so I guess the Unicode Consortium doesn't stand a chance 
either ;-)
 
> > >     It has been proposed that there should be a module for working
> > >     with UTF-16 strings in narrow Python builds through some sort of
> > >     abstraction that handles surrogates for you. If someone wants
> > >     to implement that, it will be another PEP.
> >
> > Uhm, narrow builds don't support UTF-16... it's UCS-2 which
> > is supported (basically: store everything in range(0x10000));
> > the codecs can map code points to surrogates, but it is solely
> > their responsibility and the responsibility of the application
> > using them to take care of dealing with surrogates.
> 
> The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, ....
> Just as we have a base64 module, we could have a UTF-16 module that
> interprets the data in the string as UTF-16 and does surrogate
> manipulation for you.
> 
> Anyhow, if any of those is the "real" encoding of the data, it is
> UTF-16. After all, if the codec reads in four non-BMP characters in,
> let's say, UTF-8, we represent them as 8 narrow-build Python characters.
> That's the definition of UTF-16! But it's easy enough for me to take
> that word out so I will.

u[i] gives you a code unit and whether this maps to a code point
or not is dependent on the implementation which in turn depends
on the narrow/wide choice.

In UCS-2, I believe, surrogates are regarded as two code points;
in UTF-16 they always have to come in pairs. There's a semantic
difference here which is for the codecs and these additional
tools to be aware of -- not the Unicode type implementation.

> >...
> > Also, the module will be useful for both narrow and wide builds,
> > since the notion of an encoded character can involve multiple code
> > points. In that sense Unicode is always a variable length
> > encoding for characters and that's the application field of
> > this module.
> 
> I wouldn't advise that you do all different types of normalization in a
> single module but I'll wait for your PEP.

I'll see if I find some time at the Bordeaux Python Meeting
next week.
 
> > Here's the adjusted text:
> >
> >      It has been proposed that there should be a module for working
> >      with Unicode objects using character-, word- and line- based
> >      indexing. The details of the implementation is left to
> >      another PEP.
> 
>      It has been proposed that there should be a module that handles
>      surrogates in narrow Python builds for programmers. If someone
>      wants to implement that, it will be another PEP. It might also be
>      combined with features that allow other kinds of character-,
>      word- and line- based indexing.

Hmm, I liked my version better, but what the heck ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/