[I18n-sig] How does Python Unicode treat surrogates?
Tom Emerson
tree@basistech.com
Mon, 25 Jun 2001 09:10:15 -0400
Guido van Rossum writes:
[snip]
> Agreed. But be prepared that at some point in the future the Unicode
> world might end up agreeing on 4 bytes too...
With the release of the Plane 2 ideographic extensions in Unicode 3.1
there are two options available: include surrogate support via UTF-16,
which means dealing with multibyte (really multi"word") characters, or
switching to UTF-32, allowing characters outside Plane 0 to be
accessed uniformly.
Note that this is a real issue: the Hong Kong Supplementary Character
Set includes characters contained in Plane 2 when mapped to Unicode
3.1.
> If ISO 10646 becomes important to our users, we'll have to support
> it, if only by providing a codec.
This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to
the fore.
-tree
--
Tom Emerson Basis Technology Corp.
Sr. Sinostringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"