[I18n-sig] Unicode surrogates: just say no!
Guido van Rossum
guido@digicool.com
Wed, 27 Jun 2001 19:38:04 -0400
> Guido said:
> But on a narrow interpreter, that's a valid surrogate pair, so it's a
> single character, so ord() *should* return 0x10000 for this example.
>
> IMO, once you say that a "valid surrogate pair" is a "single
> character" in a narrow implementation, people will want to do
> the indexing / slicing /dicing thing as well. ord() is just the
> thin end of the wedge.
>
> "No" should mean "no".
>
> unichr() and ord() should be inverses *only*
> in respect of scalar values up to sys.maxunicode.
Your position is weakened by inconsistency. If you really wanted to
be consistent, you should argue against \U and unichr() with ordinals
>= 0x10000 on narrow Pythons. :-)
IMO ord() and unichr() are so closely tied that either both of them
should support surrogate pairs, or none. You know my position. It's
not usable as a wedge to get the indexing/slicing/dicing, because the
implementation would be too complicated, and we have the wide Python
as a mighty weapon.
BTW, I quoted Paul:
> > * ord() will now accept surrogate pairs and return the ordinal of
> > the "wide" character. Open question: should it accept surrogate
> > pairs on wide Python builds?
and replied:
> After thinking about it, I think it should. Apps that are written
> specifically to handle surrogates (e.g. a conversion tool to remove
> surrogates!) should work on wide interpreters, and ord() is the only
> way to get the character value from a surrogate pair (short from
> implementing the shifts and masks yourself, which is doable but a
> pain).
I take that back. On wide Pythons, unichr() doesn't return surrogates
either. Once the whole world uses UCS-4 (around the time Python 3000
is released :-), surrogates can be deprecated anyway.
--Guido van Rossum (home page: http://www.python.org/~guido/)