[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 11:44:32 -0400


> Guido van Rossum writes:
> > Depends on what you call transparent.  I'm all for smart codecs
> > between UTF-16 and UTF-8, but if you have a surrogate in a Unicode
> > string, the application will have to know not to split it in the
> > middle, and it must realize that len(u) is not necessarily the number
> > of characters -- it's the number of 16-bit units in the UTF-16
> > encoding.
> 
> Surrogates were created as a way to allow characters outside Plane 0
> (the BMP) to be accessed within a sixteen-bit codespace. When using
> UTF-16 a character constists of either two-octets or four-octets. A
> character that cannot be represented within the 16-bit code space is
> encoded using a surrogate pair, but it is the same character
> regardless.
> 
> So, for example, the ideograph at U+20000 is the same character
> whether it is encoded as <20000> (UCS-4, UTF-32), <D840 DC00>
> (UTF-16), or <F0 A0 80 80> (UTF-8). It doesn't matter what
> transformation format you use: it's the *same* character.
> 
> Hence, when I have Unicode string, I'm thinking of each character as a
> Unicode character, not as a sequence of UTF-16 or UCS-2 two-octet
> words.
> 
> Hence my belief that Unicode strings should not be synonymous with the
> underlying physical character representation is used.
> 
> Clear as mud? :-)
> 
>     -tree

Very clear.

But, just as a Python 8-bit string object containing the UTF-8 encoded
character U+20000 contains 4 bytes, with s[0] being '\xF0' etc., a
Python "unicode" string containing that character as a surrogate will
have length 2, with u[0] being u'\uD840' and u[1] being u'\uDC00'.
You can think of it as containing a single character, but the
interface gives you the individual items of the UTF-16 encoding.

You can believe what *should* happen all you want, but we're not going
to change this soon.  u[i] has to be independent of the length of u
and the value of i.

It may change *eventually* -- when we switch to UCS-4 for the internal
representation.  Until then, the API will deal in 16-bit values that
may or may not be "characters".

I'd say that ideally the choice to have a 2 or 4 byte internal
representation (or no Unicode support at all, for some platforms like
PalmOS!) should be a configuration choice.  Right now the
implementation doesn't allow that choice at all, which should be
remedied -- maybe you can help by submitting patches?

--Guido van Rossum (home page: http://www.python.org/~guido/)