[issue45025] Reliance on C bit fields in C API is undefined behavior

STINNER Victor report at bugs.python.org
Mon Aug 30 11:37:53 EDT 2021


STINNER Victor <vstinner at python.org> added the comment:

> PyUnicode_KIND does *not* expose the implementation details to the programmer.

PyUnicode_KIND() is very specific to the exact PEP 393 implementation. Documentation of this field:
---
/* Character size:

   - PyUnicode_WCHAR_KIND (0):

     * character type = wchar_t (16 or 32 bits, depending on the
       platform)

   - PyUnicode_1BYTE_KIND (1):

     * character type = Py_UCS1 (8 bits, unsigned)
     * all characters are in the range U+0000-U+00FF (latin1)
     * if ascii is set, all characters are in the range U+0000-U+007F
       (ASCII), otherwise at least one character is in the range
       U+0080-U+00FF

   - PyUnicode_2BYTE_KIND (2):

     * character type = Py_UCS2 (16 bits, unsigned)
     * all characters are in the range U+0000-U+FFFF (BMP)
     * at least one character is in the range U+0100-U+FFFF

   - PyUnicode_4BYTE_KIND (4):

     * character type = Py_UCS4 (32 bits, unsigned)
     * all characters are in the range U+0000-U+10FFFF
     * at least one character is in the range U+10000-U+10FFFF
 */
unsigned int kind:3;
---

I don't think that PyUnicode_KIND() makes sense if CPython uses UTF-8 tomorrow.


> If the internal representation os strings is switched to use masks and shifts instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new details without breaking API compatibility.

PyUnicode_KIND() was exposed in the *public* C API because unicodeobject.h provides functions as macros for best performances, and these macros use PyUnicode_KIND() internally.

Macros like PyUnicode_READ(kind, data, index) are also designed for best performances with the exact PEP 393 implementation.

The public C API should only contain PyUnicode_READ_CHAR(unicode, index): this macro doesn't use "kind" or "data" which are (again) specific to the PEP 393.

In the CPython implementation, we should use the most efficient code, it's fine to use macros accessing directly structures.

But for the public C API, I would recommend to only provide abstractions, even if there are a little bit slower.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45025>
_______________________________________


More information about the Python-bugs-list mailing list