[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 10:43:02 -0400


> Guido van Rossum writes:
> [...]
> > I'm all for taking the lazy approach and letting applications that
> > need surrogate support do it themselves, at the application level.
> 
> Meaning what? Leaving it up to the application to be entirely
> responsible for handling surrogates is a mistake. As was stated
> earlier in the thread (apologies, I don't have the message around to
> make the appropriate attribution) surrogates are an implementation
> detail: to the user/application developer the presence of the
> surrogate pair needs to be transparent.
> 
> As long as the Unicode support functionality groks surrogates
> correctly (fully implements UTF-16) then the issue becomes a small one
> for the end user. The scanner would need to be modified to support
> Unicode escapes for values up to 0x10FFFF. Internally these are
> represented as surrogates.
> 
> Put the burden of these multibyte representations on the library
> implementor, not the end-user.
> 
>     -tree

Depends on what you call transparent.  I'm all for smart codecs
between UTF-16 and UTF-8, but if you have a surrogate in a Unicode
string, the application will have to know not to split it in the
middle, and it must realize that len(u) is not necessarily the number
of characters -- it's the number of 16-bit units in the UTF-16
encoding.

Does that make sense?

I know I am hindered by a lack of understanding of Unicode
hairsplitting, angels-on-a-pin-dancing details; if I'm missing
something, it's likely that many other people don't know the details
either, so an explanation would be much appreciated!

--Guido van Rossum (home page: http://www.python.org/~guido/)