break unichr instead of fix ord?

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Wed Aug 26 22:52:41 EDT 2009


On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:

> But regardless, the significant question is, what is the reason for
> having ord() (and unichr) not work for surrogate pairs and thus not
> usable with a large number of unicode characters that Python otherwise
> supports?


I'm no expert on Unicode, but my guess is that the reason is out of a 
desire for simplicity: unichr() should always return a single char, not a 
pair of chars, and similarly ord() should take as input a single char, 
not two, and return a single number.

Otherwise it would be ambiguous whether ord(surrogate_pair) should return 
a pair of ints representing the codes for each item in the pair, or a 
single int representing the code point for the whole pair.

E.g. given your earlier example:

>>> a = u'\U00010040'
>>> len(a)
2
>>> a[0]
u'\ud800'
>>> a[1]
u'\udc40'

would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040? If the 
latter, what about ord(u'ab')?

Remember that a unicode string can contain code points that aren't valid 
characters:

>>> ord(u'\ud800')  # reserved for surrogates, not a character
55296

so if ord() sees a surrogate pair, it can't assume it's meant to be 
treated as a surrogate pair rather than a pair of code points that just 
happens to match a surrogate pair.

None of this means you can't deal with surrogate pairs, it just means you 
can't deal with them using ord() and unichr().

The above is just my guess, I'd be interested to hear what others say.


-- 
Steven



More information about the Python-list mailing list