[I18n-sig] Unicode surrogates: just say no!

Machin, John JMachin@Colonial.com.au
Thu, 28 Jun 2001 09:14:17 +1000


Guido said:
   But on a narrow interpreter, that's a valid surrogate pair, so it's a
   single character, so ord() *should* return 0x10000 for this example.

IMO, once you say that a "valid surrogate pair" is a "single
character" in a narrow implementation, people will want to do
the indexing / slicing /dicing thing as well. ord() is just the 
thin end of the wedge.

"No" should mean "no".

unichr() and ord() should be inverses *only*
in respect of scalar values up to sys.maxunicode.

-----Original Message-----
From: Guido van Rossum [mailto:guido@digicool.com]
Sent: Thursday, 28 June 2001 8:59
To: Machin, John
Cc: Paul Prescod; i18n-sig@python.org
Subject: Re: [I18n-sig] Unicode surrogates: just say no!


> The "nice pair of invariants" for unichr() and ord() seem to involve
> what I call "all that variable-length mucking about" and Tim more
> robustly called "crap".
> 
> IMO, there should be a very short list of places where a narrow 
> Unicode implementation will need to know anything at all about
> surrogates. This short list will include codecs, the 
> \Uxxxxxxxx notation for literals, and unichr() --- the users can 
> ship it into the warehouse and ship it out again, but it won't be
> processed as other than 16-bit values.  Attempts to place other
> items on the list should be rigorously justified.

Thanks, that's about what I wanted to say!  But I assume you meant to
include ord() in that list, as it is unichr()'s inverse.  We should
have one place that implements the surrogate creation magic (unichr)
and one place that implements the surrogate unpacking magic (ord).
(Plus \U, which is to act like unichr(), and codecs.)

> Guido asked:
>    What should ord(u'\uD800\uDC00') mean on a wide interpreter? 
> 
> IMO, this should mean an exception on *both* narrow and wide
> interpreters, just as ord("xy") does. ord() should expect one
> and only one *character*

But on a narrow interpreter, that's a valid surrogate pair, so it's a
single character, so ord() *should* return 0x10000 for this example.

> Let's just keep on saying no!

Yes!

--Guido van Rossum (home page: http://www.python.org/~guido/)


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************