[I18n-sig] Re: Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Tue, 26 Jun 2001 19:34:16 -0400


> 1) Sort order.  Unicode strings should sort in Unicode lexicographical
>    order.  With UCS-4 this is easy; just compare the Py_UNICODE values
>    one by one like C does with strcmp().  With UTF-16 this is more
>    complicated when surrogates get involved.  Basically, you go
>    through the strings being compared until you find the first
>    difference.  If both characters at this point are in the BMP or
>    both are high surrogates, just compare them as usual.  However, if
>    one is in the BMP and the other is a surrogate, you need to make
>    sure that the string with the surrogate in it sorts after the one
>    with the BMP character.  Straight comparison won't work since there
>    are characters in the BMP with numerical values greater than those
>    of surrogates.
> 
>    I believe that this is the right thing to do when Py_UNICODE is
>    UCS-2 since the added complexity is only O(1) per string comparison
>    and is very easy to implement.  This will ensure that
>    cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and
>    correctly for both UCS-2 and UCS-4.

I'm neutral on this one; on the one hand I think we should minimize
the surrogate support outside the codecs, on the other hand this makes
some sense.

> 2) There is an incompatibility between the two approaches since
>    unichr(high surrogate) + unichr(low surrogate) will magically be
>    the same as unichr(the approriate astral codepoint) when UCS-2 is
>    used.  With UCS-4 they will not; it will result in a string with
>    two values that have no well-defined meaning.
> 
>    I don't think this is a show-stopper, but people will need to be
>    made aware.

Agreed.

--Guido van Rossum (home page: http://www.python.org/~guido/)