break unichr instead of fix ord?

Sat Aug 29 20:12:24 EDT 2009

On 08/29/2009 12:06 PM, Steven D'Aprano wrote:
[...]
>>  The reasons for the current behavior so far:
>>
>>  1.
>>>  What you propose would break the property "unichr(i) always returns a
>>>  string of length one, if it returns anything at all".
>>
>>  Yes.  And i don't see the problem with that.  Why is that property more
>>  desirable than the non-existent property that a Unicode literal always
>>  produces one python character?
>
> What do you mean? Unicode literals don't always produce one character,
> e.g. u'abcd' is a Unicode literal with four characters.

I'm sorry, I should have been clearer.  I meant the literal
representation of a *single* unicode character.  u'\u4000'
which results in a string of length 1, vs u'\U00010040' which
results in a string of length 2.  In both case the literal
represents a single unicode code point.

> I think it's fairly self-evident that a function called uniCHR [emphasis
> added] should return a single character (technically a single code
> point).

There are two concepts of characters here, the 16-bit things
that encodes a character in a Python unicode string (in a
narrow build Python), and a character in the sense of one
of the ~2**10 unicode characters.  Python has chosen to
represent the latter (when outside the BMP) as a pair of
surrogate characters from the former.  I don't see why one
would assume that CHR would mean the python 16-bit
character concept rather than the full unicode character
concept.  In fact, rather the opposite.

> But even if you can come up with a reason for unichr() to return
> two or more characters,

I've given a number of reasons why it should return a two
character representation of a non-BMP character, one of
which is that that is how Python has chosen to represent
such characters internally.  I won't repeat the other
reasons again.

I'm not sure why you think more than two characters
would ever be possible.

> this would break code that relies on the
> documented promise that the length of the output of unichr() is always
> one.

Ah, OK.  This is the good reason I was looking for.
I did not realize (until prompted by your remark
to go back and look at the early docs) that unichr
had been documented to return a single character
since 2.0 and that wide character support was added
in 2.2.  Martin v. Loewis also implied that, I now
see, although the implication was too deep for me
to pick up.

So although it leads to a suboptimal situation, I
agree that maintaining the documented behavior was
necessary.

[...]
> I would much rather see a pair of new functions, wideord() and
> widechr() used for converting between surrogate pairs and numbers.

I guess if it were still 2001 and Python 2.2 was
coming out I would be in favor of this too. :-)