break unichr instead of fix ord?

Sat Aug 29 14:06:34 EDT 2009

On Sat, 29 Aug 2009 07:38:51 -0700, rurpy wrote:

>  > Then, the next question is "why is it implemented that way", to which
>  > the answer is "because the PEP says so".
> 
> Not at all a satisfying answer unless one believes in PEPal
> infallibility. :-)

Not at all. You don't have to believe that PEPs are infallible to accept 
the answer, you just have to understand that major changes to Python 
aren't made arbitrarily, they have to go through a PEP first. Even Guido 
himself has to write a PEP before making any major changes to the 
language. But PEPs aren't infallible, they can be challenged, rejected, 
withdrawn or made obsolete by new PEPs.

> The reasons for the current behavior so far:
> 
> 1.
>> What you propose would break the property "unichr(i) always returns a
>> string of length one, if it returns anything at all".
> 
> Yes.  And i don't see the problem with that.  Why is that property more
> desirable than the non-existent property that a Unicode literal always
> produces one python character?

What do you mean? Unicode literals don't always produce one character, 
e.g. u'abcd' is a Unicode literal with four characters.

I think it's fairly self-evident that a function called uniCHR [emphasis 
added] should return a single character (technically a single code 
point). But even if you can come up with a reason for unichr() to return 
two or more characters, this would break code that relies on the 
documented promise that the length of the output of unichr() is always 
one.

> It would only occur on a narrow build
> with a unicode character outside of the bmp, exactly the condition a
> unicode literal can "behave differently" by producing two python
> characters.

> 2.
>> >  But there is no reason given [in the PEP] for that behavior.
>> Sure there is, right above the list:
>> "Most things will behave identically in the wide and narrow worlds."
>> That's the reason: scripts should work the same as much as possible in
>> wide and narrow builds.
> 
> So what else would work "differently"?  

unichr(n) sometimes would return one character and sometimes two; ord(c) 
would sometimes accept two characters and sometimes raise an exception. 
That's a fairly major difference.

> My point was that extending
> unichr/ord to work with all unicode characters reduces differences far
> more often than it increase them.

I don't see that at all. What differences do you think it would reduce?

> 3.
>>>       * There is a convention in the Unicode world for
>>>         encoding a 32-bit code point in terms of two 16-bit code
>>>         points. These are known as "surrogate pairs". Python's codecs
>>>         will adopt this convention.
>>>
>>>  Is a distinction made between Python and Python codecs with only the
>>>  latter having any knowledge of surrogate pairs?
>>
>> No. In the end, the Unicode type represents code units, not code
>> points, i.e. half surrogates are individually addressable. Codecs need
>> to adjust to that; in particular the UTF-8 and the UTF-32 codec in
>> narrow builds, and the UTF-16 codec in wide builds (which didn't exist
>> when the PEP was written).
> 
> OK, so that is not a reason either.

I think it is a very important reason. Python supports code points, so it 
has to support surrogate codes individually. Python can't tell if the 
pair of code points u'\ud800\udc40' represents the single character 
\U00010040 or a pair of code points \ud800 and \udc40.

> 4.
> I'll speculate a little.
> If surrogate handling was added to ord/unichr, it would be the top of a
> slippery slope leading to demands that other string functions also
> handle surrogates.
> 
> But this is not true -- there is a strong distinction between ord/unichr
> and other string methods.  The latter deal with strings of multiple
> characters.  But the former deals only with single characters (taking a
> surrogate pair as a single unicode character.)

Strictly speaking, unichr() deals with code points, not characters, 
although the distinction is very fine.

>>> c = unichr(56384)
>>> len(c)
1
>>> import unicodedata
>>> unicodedata.category(c)
'Cs'

Cs is the general category for "Other, Surrogate", so \udc40 is not 
strictly speaking a character. Nevertheless, Python treats it as one.

> To reiterate, I am not advocating for any change.  I simply want to
> understand if there is a good reason for limiting the use of unchr/ord
> on narrow builds to a subset of the unicode characters that Python
> otherwise supports.  So far, it seems not and that unichr/ord is a
> poster child for "purity beats practicality".

On the contrary, it seems pretty impractical to me for ord() to sometimes 
successfully accept strings of length two and sometimes to raise an 
exception. I would much rather see a pair of new functions, wideord() and 
widechr() used for converting between surrogate pairs and numbers.

-- 
Steven