[Python-Dev] PyUnicode_GetMax() and PyUnicode_FromOrdinal() Was: Breaking undocumented API

Tue Nov 16 20:52:07 CET 2010

On Tue, Nov 16, 2010 at 1:57 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> Alexander Belopolsky wrote:
>> On Tue, Nov 16, 2010 at 1:06 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> ..
>>> Now, we can't use a macro for [PyUnicode_GetMax()], since the information has
>>> to be available as callable in order to applications or extensions
>>> to use it (without recompile).
>>>
>>
>> .. but it *is* a macro resolving to either PyUnicodeUCS2_GetMax or
>> PyUnicodeUCS4_GetMax.
>
> That doesn't count :-) It's only a trick to prevent external code
> from using the wrong Unicode APIs.
>
> There still is a real function behind the renaming.
>
>> What is the scenario when may want to change
>> what PyUnicodeUCS?_GetMax return and have extensions pick up the
>> change without a recompile?
>
> If an extensions uses the stable ABI, it will want to know
> whether the interpreter was built for UCS2 or UCS4 (even if
> it doesn't use the Unicode APIs directly).
>
>> UCS2 case will certainly never change
>> since it is already 0xFFFF.  Is it possible that USC4 will be expanded
>> beyond 0x10FFFF?
>
> Well, the Unicode Consortium decided to not go beyond 0x10FFFF,
> but then you never know... when they started out on the quest,
> 16 bits appeared more than enough, but they found out relatively
> quickly that the Asian scripts had enough code points to easily
> fill that space.
>
> Once space is available, it tends to get used sooner or later :-)
>
>> Note that we can have both a macro and a function
>> version.  This is fairly standard practice in Python C-API.
>
> Sure, but what for ?

Note that PyUnicode_FromOrdinal()  is documented (in unicodeobject.h)
as follows without a reference to PyUnicode_GetMax():

"""
   Create a Unicode Object from the given Unicode code point ordinal.

   The ordinal must be in range(0x10000) on narrow Python builds
   (UCS2), and range(0x110000) on wide builds (UCS4). A ValueError is
   raised in case it is not.
"""

The actual implementation actually checks UCS4 range only.

    if (ordinal < 0 || ordinal > 0x10ffff) {
	PyErr_SetString(PyExc_ValueError,
                        "chr() arg not in range(0x110000)");
        return NULL;
    }

This actually looks like a bug:

>>> len(chr(0x10FFFF))
2

(on a USC2 build.)

Also, I think PyUnicode_FromOrdinal()  should take Py_UNICODE argument
rather than int.