[Python-Dev] Support for "wide" Unicode characters

Just van Rossum just@letterror.com
Sun, 1 Jul 2001 16:43:08 +0200


Guido van Rossum wrote:

> > <PEP: 261>
> > 
> >    The problem I have with this PEP is that it is a compile time option
> > which makes it hard to work with both 32 bit and 16 bit strings in one
> > program. Can not the 32 bit string type be introduced as an additional type?
> 
> Not without an outrageous amount of additional coding (every place in
> the code that currently uses PyUnicode_Check() would have to be
> bifurcated in a 16-bit and a 32-bit variant).

Alternatively, a Unicode object could *internally* be either 8, 16 or 32 bits
wide (to be clear: not per character, but per string). Also a lot of work, but
it'll be a lot less wasteful.

> I doubt that the desire to work with both 16- and 32-bit characters in
> one program is typical for folks using Unicode -- that's mostly
> limited to folks writing conversion tools.  Python will offer the
> necessary codecs so you shouldn't have this need very often.

Not a lot of people will want to work with 16 or 32 bit chars directly, but I
think a less wasteful solution to the surrogate pair problem *will* be desired
by people. Why use 32 bits for all strings in a program when only a tiny
percentage actually *needs* more than 16? (Or even 8...)

> > Iteration through the code units in a string is a problem waiting to bite
> > you and string APIs should encourage behaviour which is correct when faced
> > with variable width characters, both DBCS and UTF style.
> 
> But this is not the Unicode philosophy.  All the variable-length
> character manipulation is supposed to be taken care of by the codecs,
> and then the application can deal in arrays of characteres.

Right: this is the way it should be.

My difficulty with PEP 261 is that I'm afraid few people will actually enable
32-bit support (*what*?! all unicode strings become 32 bits wide? no way!),
therefore making programs non-portable in very subtle ways.

Just