[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Wed, 27 Jun 2001 18:58:36 -0400


> The "nice pair of invariants" for unichr() and ord() seem to involve
> what I call "all that variable-length mucking about" and Tim more
> robustly called "crap".
> 
> IMO, there should be a very short list of places where a narrow 
> Unicode implementation will need to know anything at all about
> surrogates. This short list will include codecs, the 
> \Uxxxxxxxx notation for literals, and unichr() --- the users can 
> ship it into the warehouse and ship it out again, but it won't be
> processed as other than 16-bit values.  Attempts to place other
> items on the list should be rigorously justified.

Thanks, that's about what I wanted to say!  But I assume you meant to
include ord() in that list, as it is unichr()'s inverse.  We should
have one place that implements the surrogate creation magic (unichr)
and one place that implements the surrogate unpacking magic (ord).
(Plus \U, which is to act like unichr(), and codecs.)

> Guido asked:
>    What should ord(u'\uD800\uDC00') mean on a wide interpreter? 
> 
> IMO, this should mean an exception on *both* narrow and wide
> interpreters, just as ord("xy") does. ord() should expect one
> and only one *character*

But on a narrow interpreter, that's a valid surrogate pair, so it's a
single character, so ord() *should* return 0x10000 for this example.

> Let's just keep on saying no!

Yes!

--Guido van Rossum (home page: http://www.python.org/~guido/)