[Python-Dev] Support for "wide" Unicode characters

Neil Hodgson nhodgson@bigpond.net.au
Mon, 2 Jul 2001 12:52:45 +1000


Guido van Rossum:

> >    This wasn't usefully true in the past for DBCS strings and is
> > not the right way to think of either narrow or wide strings
> > now. The idea that strings are arrays of characters gets in
> > the way of dealing with many encodings and is the primary
> > difficulty in localising software for Japanese.
>
> Can you explain the kind of problems encountered in some more detail?

   Programmers used to working with character == indexable code unit will
often split double wide characters when performing an action. For example
searching for a particular double byte character "bc" may match "abcd"
incorrectly where "ab" and "cd" are the characters. DBCS is not normally
self synchronising although UTF-8 is. Another common problem is counting
characters, for example when filling a line, hitting the line width and
forcing half a character onto the next line.

> I think it's a good idea to provide a set of higher-level tools as
> well.  However nobody seems to know what these higher-level tools
> should do yet.  PEP 261 is specifically focused on getting the
> lower-level foundations right (i.e. the objects that represent arrays
> of code units), so that the authors of higher level tools will have a
> solid base.  If you want to help author a PEP for such higher-level
> tools, you're welcome!

   Its more likely I'll publish some of the low level pieces of
Scintilla/SinkWorld as a Python extension providing some of these facilities
in an editable-text class. Then we can see if anyone else finds the code
worthwhile.

   Neil