[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 26 Jun 2001 01:15:58 +0200


> With the release of the Plane 2 ideographic extensions in Unicode 3.1
> there are two options available: include surrogate support via UTF-16,
> which means dealing with multibyte (really multi"word") characters, or
> switching to UTF-32, allowing characters outside Plane 0 to be
> accessed uniformly.
> 
> Note that this is a real issue: the Hong Kong Supplementary Character
> Set includes characters contained in Plane 2 when mapped to Unicode
> 3.1.

The most likely solution, of course, for the time to come, is: Ignore
characters outside the BMP. IMO, Tim Peter's view is right: If the
internal representation uses surrogates, indexing should ignore this,
and count a surrogate pair as one character. This is not going to
happen unless somebody comes up with an efficient implementation.

The obvious alternative solution is to use a 32-bit Py_UNICODE, which,
given Guido's comment, is also not going to happen.

So nothing will happen until enough Chinese users complain. I don't
know whether you count as Chinese for these purposes :-)

Regards,
Martin

P.S. The real issue IMO is display: If there are fonts supporting
these characters, people will want to write programs that make use of
the fonts. Until nobody can actually display such text, nobody will
request that indexing works reasonable.

P.P.S. Of course, if we wait until users actually use surrogates, it
is too late to change the indexing - that would likely break people's
code.