[Python-Dev] Support for "wide" Unicode characters

Guido van Rossum guido@digicool.com
Sun, 01 Jul 2001 14:37:48 -0400


> Alternatively, a Unicode object could *internally* be either 8, 16
> or 32 bits wide (to be clear: not per character, but per
> string). Also a lot of work, but it'll be a lot less wasteful.

Depending on what you prefer to waste: developers' time or computer
resources.  I bet that if you try the measure the wasted space you'll
find that it wastes very little compared to all the other overheads
in a typical Python program: CPU time compared to writing your code in
C, memory overhead for integers, etc.

It so happened that the Unicode support was written to make it very
easy to change the compile-time code unit size; but making this a
per-string (or even global) run-time variable is much harder without
touching almost every place that uses Unicode (not to mention slowing
down the common case).

Nobody was enthusiastic about fixing this, so our choice was really
between staying with 16 bits or making 32 bits an option for those who
need it.

> Not a lot of people will want to work with 16 or 32 bit chars
> directly,

How do you know?  There are more Chinese than Americans and Europeans
together, and they will soon all have computers. :-)

> but I think a less wasteful solution to the surrogate pair
> problem *will* be desired by people. Why use 32 bits for all strings
> in a program when only a tiny percentage actually *needs* more than
> 16? (Or even 8...)

So work in UTF-8 -- a lot of work can be done in UTF-8.

> > But this is not the Unicode philosophy.  All the variable-length
> > character manipulation is supposed to be taken care of by the codecs,
> > and then the application can deal in arrays of characteres.
> 
> Right: this is the way it should be.
> 
> My difficulty with PEP 261 is that I'm afraid few people will
> actually enable 32-bit support (*what*?! all unicode strings become
> 32 bits wide? no way!), therefore making programs non-portable in
> very subtle ways.

My hope and expectation is that those folks who need 32-bit support
will enable it.  If this solution is not sufficient, we may have to
provide something else in the future, but given that the
implementation effort for PEP 261 was very minimal (certainly less
than the time expended in discussing it) I am very happy with it.

It will take quite a while until lots of folks will need the 32-bit
support (there aren't that many characters defined outside the basic
plane yet).  In the mean time, those that need to 32-bit support
should be happy that we allow them to rebuild Python with 32-bit
support.  In the next 5-10 years, the 32-bit support requirement will
become more common -- as will be the memory upgrades to make it
painless.

It's not like Python is making this decision in a vacuum either: Linux
already has 32-bit wchar_t.  32-bit characters will eventually be
common (even in Windows, which probably has the largest investment in
16-bit Unicode at the moment of any system).  Like IPv6, we're trying
to enable uncommon uses of Python without breaking things for the
not-so-early adopters.

Again, don't see PEP 261 as the ultimate answer to all your 32-bit
Unicode questions.  Just consider that realistically we have two
choices: stick with 16-bit support only or make 32-bit support an
option.  Other approaches (more surrogate support, run-time choices,
transparent variable-length encodings) simply aren't realistic --
no-one has the time to code them.

It should be easy to write portable Python programs that work
correctly with 16-bit Unicode characters on a "narrow" interpreter and
also work correctly with 21-bit Unicode on a "wide" interpreter:
just avoid using surrogates.  If you *need* to work with surrogates,
try to limit yourself to very simple operations like concatenations of
valid strings, and splitting strings at known delimiters only.
There's a lot you can do with this.

--Guido van Rossum (home page: http://www.python.org/~guido/)