break unichr instead of fix ord?

"Martin v. Löwis" martin at v.loewis.de
Fri Aug 28 04:12:34 EDT 2009


> The PEP says:
>      * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
>        length-one string.
> 
>      * unichr(i) for 2**16 <= i <= TOPCHAR will return a
>        length-one string on wide Python builds. On narrow
>        builds it will raise ValueError.
> and
>      * ord() is always the inverse of unichr()
> 
> which of course we know; that is the current behavior.  But
> there is no reason given for that behavior.

Sure there is, right above the list:

"Most things will behave identically in the wide and narrow worlds."

That's the reason: scripts should work the same as much as possible
in wide and narrow builds.

What you propose would break the property "unichr(i) always returns
a string of length one, if it returns anything at all".

>    1) Should surrogate pairs be disallowed on narrow builds?
> That appears to have been answered in the negative and is
> not relevant to my question.

It is, as it does lead to inconsistencies between wide and narrow
builds. OTOH, it also allows the same source code to work on both
versions, so it also preserves the uniformity in a different way.

>      * every Python Unicode character represents exactly
>        one Unicode code point (i.e. Python Unicode
>        Character = Abstract Unicode character)
> 
> I'm not sure what this means (what's an abstract unicode
> character?).

I don't think this is actually the case, but I may be confusing
Unicode terminology here - "abstract character" is a term from
the Unicode standard.

> Finally we read:
> 
>      * There is a convention in the Unicode world for
>        encoding a 32-bit code point in terms of two
>        16-bit code points. These are known as
>        "surrogate pairs". Python's codecs will adopt
>        this convention.
> 
> Is a distinction made between Python and Python
> codecs with only the latter having any knowledge of
> surrogate pairs?

No. In the end, the Unicode type represents code units,
not code points, i.e. half surrogates are individually
addressable. Codecs need to adjust to that; in particular
the UTF-8 and the UTF-32 codec in narrow builds, and the
UTF-16 codec in wide builds (which didn't exist when the
PEP was written).

> Nothing else in the PEP seems remotely relevant.

Except for the motivation, of course :-)

In addition: your original question was "why has this
been changed", to which the answer is "it hasn't".
Then, the next question is "why is it implemented that
way", to which the answer is "because the PEP says so".
Only *then* the question is "what is the rationale for
the PEP specifying things the way it does". The PEP is
relevant so that we can both agree that Python behaves
correctly (in the sense of behaving as specified).

Regards,
Martin



More information about the Python-list mailing list