[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 19:38:04 -0400


> Guido said:
>    But on a narrow interpreter, that's a valid surrogate pair, so it's a
>    single character, so ord() *should* return 0x10000 for this example.
> 
> IMO, once you say that a "valid surrogate pair" is a "single
> character" in a narrow implementation, people will want to do
> the indexing / slicing /dicing thing as well. ord() is just the 
> thin end of the wedge.
> 
> "No" should mean "no".
> 
> unichr() and ord() should be inverses *only*
> in respect of scalar values up to sys.maxunicode.

Your position is weakened by inconsistency.  If you really wanted to
be consistent, you should argue against \U and unichr() with ordinals
>= 0x10000 on narrow Pythons. :-)

IMO ord() and unichr() are so closely tied that either both of them
should support surrogate pairs, or none.  You know my position.  It's
not usable as a wedge to get the indexing/slicing/dicing, because the
implementation would be too complicated, and we have the wide Python
as a mighty weapon.

BTW, I quoted Paul:

> >     * ord() will now accept surrogate pairs and return the ordinal of
> >       the "wide" character. Open question: should it accept surrogate
> >       pairs on wide Python builds?

and replied:

> After thinking about it, I think it should.  Apps that are written
> specifically to handle surrogates (e.g. a conversion tool to remove
> surrogates!) should work on wide interpreters, and ord() is the only
> way to get the character value from a surrogate pair (short from
> implementing the shifts and masks yourself, which is doable but a
> pain).

I take that back.  On wide Pythons, unichr() doesn't return surrogates
either.  Once the whole world uses UCS-4 (around the time Python 3000
is released :-), surrogates can be deprecated anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)