[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 10:22:40 -0400


> Guido van Rossum writes:
> [snip]
> > Agreed.  But be prepared that at some point in the future the Unicode
> > world might end up agreeing on 4 bytes too...
> 
> With the release of the Plane 2 ideographic extensions in Unicode 3.1
> there are two options available: include surrogate support via UTF-16,
> which means dealing with multibyte (really multi"word") characters, or
> switching to UTF-32, allowing characters outside Plane 0 to be
> accessed uniformly.
> 
> Note that this is a real issue: the Hong Kong Supplementary Character
> Set includes characters contained in Plane 2 when mapped to Unicode
> 3.1.
> 
> > If ISO 10646 becomes important to our users, we'll have to support
> > it, if only by providing a codec.
> 
> This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to
> the fore.
> 
>     -tree

I don't think switching to a 32-bit character is the right thing to do
for us (although I think it should be easier than it currently is --
changing to define Py_UNICODE as a 32-bit unsigned int should be all
that it takes, which is currently not the case).

I'm all for taking the lazy approach and letting applications that
need surrogate support do it themselves, at the application level.

--Guido van Rossum (home page: http://www.python.org/~guido/)