[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 09:10:15 -0400


Guido van Rossum writes:
[snip]
> Agreed.  But be prepared that at some point in the future the Unicode
> world might end up agreeing on 4 bytes too...

With the release of the Plane 2 ideographic extensions in Unicode 3.1
there are two options available: include surrogate support via UTF-16,
which means dealing with multibyte (really multi"word") characters, or
switching to UTF-32, allowing characters outside Plane 0 to be
accessed uniformly.

Note that this is a real issue: the Hong Kong Supplementary Character
Set includes characters contained in Plane 2 when mapped to Unicode
3.1.

> If ISO 10646 becomes important to our users, we'll have to support
> it, if only by providing a codec.

This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to
the fore.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"