[Python-Dev] Support for "wide" Unicode characters
Paul Prescod
paulp@ActiveState.com
Sun, 01 Jul 2001 11:08:10 -0700
Neil Hodgson wrote:
>
> Paul Prescod:
> <PEP: 261>
>
> The problem I have with this PEP is that it is a compile time option
> which makes it hard to work with both 32 bit and 16 bit strings in one
> program. Can not the 32 bit string type be introduced as an additional type?
The two solutions are not mutually exclusive. If you (or someone)
supplies a 32-bit type and Guido accepts it, then the compile option
might fall into disuse. But this solution was chosen because it is much
less work. Really though, I think that having 16-bit and 32-bit types is
extra confusion for very little gain. I would much rather have a single
space-efficient type that hid the details of its implementation. But
nobody has volunteered to code it and Guido might not accept it even if
someone did.
>...
> This wasn't usefully true in the past for DBCS strings and is not the
> right way to think of either narrow or wide strings now. The idea that
> strings are arrays of characters gets in the way of dealing with many
> encodings and is the primary difficulty in localising software for Japanese.
The whole benfit of moving to 32-bit character strings is to allow
people to think of strings as arrays of characters. Forcing them to
consider variable-length encodings is precisely what we are trying to
avoid.
> Iteration through the code units in a string is a problem waiting to bite
> you and string APIs should encourage behaviour which is correct when faced
> with variable width characters, both DBCS and UTF style. Iteration over
> variable width characters should be performed in a way that preserves the
> integrity of the characters.
On wide Python builds there is no such thing as variable width Unicode
characters. It doesn't make sense to combine two 32-bit characters to
get a 64-bit one. On narrow Python builds you might want to treat a
surrogate pair as a single character but I would strongly advise against
it. If you want wide characters, move to a wide build. Even if a narrow
build is more space efficient, you'll lose a ton of performance
emulating wide characters in Python code.
> ... M.-A. Lemburg's proposed set of iterators could
> be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to
> provide for the various intended string uses such as "for c in
> s.inVisualOrder()" reversing the receipt of right-to-left substrings.
A floor wax and a desert topping. <0.5 wink>
I don't think that the average Python programmer would want
s.asCharacters('utf-8') when they already have s.decode('utf-8'). We
decided a long time ago that the model for standard users would be
fixed-length (1!), abstract characters. That's the way Python's Unicode
subsystem has always worked.
--
Take a recipe. Leave a recipe.
Python Cookbook! http://www.ActiveState.com/pythoncookbook