[Python-Dev] Support for "wide" Unicode characters

Sun, 01 Jul 2001 11:08:10 -0700

Neil Hodgson wrote:
> 
> Paul Prescod:
> <PEP: 261>
> 
>    The problem I have with this PEP is that it is a compile time option
> which makes it hard to work with both 32 bit and 16 bit strings in one
> program. Can not the 32 bit string type be introduced as an additional type?

The two solutions are not mutually exclusive. If you (or someone)
supplies a 32-bit type and Guido accepts it, then the compile option
might fall into disuse. But this solution was chosen because it is much
less work. Really though, I think that having 16-bit and 32-bit types is
extra confusion for very little gain. I would much rather have a single
space-efficient type that hid the details of its implementation. But
nobody has volunteered to code it and Guido might not accept it even if
someone did.

>...
>    This wasn't usefully true in the past for DBCS strings and is not the
> right way to think of either narrow or wide strings now. The idea that
> strings are arrays of characters gets in the way of dealing with many
> encodings and is the primary difficulty in localising software for Japanese.

The whole benfit of moving to 32-bit character strings is to allow
people to think of strings as arrays of characters. Forcing them to
consider variable-length encodings is precisely what we are trying to
avoid.

> Iteration through the code units in a string is a problem waiting to bite
> you and string APIs should encourage behaviour which is correct when faced
> with variable width characters, both DBCS and UTF style. Iteration over
> variable width characters should be performed in a way that preserves the
> integrity of the characters. 

On wide Python builds there is no such thing as variable width Unicode
characters. It doesn't make sense to combine two 32-bit characters to
get a 64-bit one. On narrow Python builds you might want to treat a
surrogate pair as a single character but I would strongly advise against
it. If you want wide characters, move to a wide build. Even if a narrow
build is more space efficient, you'll lose a ton of performance
emulating wide characters in Python code.

> ... M.-A. Lemburg's proposed set of iterators could
> be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to
> provide for the various intended string uses such as "for c in
> s.inVisualOrder()" reversing the receipt of right-to-left substrings.

A floor wax and a desert topping. <0.5 wink>

I don't think that the average Python programmer would want
s.asCharacters('utf-8') when they already have s.decode('utf-8'). We
decided a long time ago that the model for standard users would be
fixed-length (1!), abstract characters. That's the way Python's Unicode
subsystem has always worked.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook