[Python-Dev] PEP 393 review

Thu Aug 25 00:29:19 CEST 2011

> With this PEP, the unicode object overhead grows to 10 pointer-sized
> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
> Does it have any adverse effects?

For pure ASCII, it might be possible to use a shorter struct:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    Py_hash_t hash;
    int state;
    Py_ssize_t wstr_length;
    wchar_t *wstr;
    /* no more utf8_length, utf8, str */
    /* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)

=> "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    Py_hash_t hash;
    int state;
    Py_ssize_t wstr_length;
    wchar_t *wstr;
    Py_ssize_t utf8_length;
    char *utf8;
    /* no more str pointer */
    /* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)

=> "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    Py_hash_t hash;
    int state;
    Py_ssize_t wstr_length;
    wchar_t *wstr;
    Py_ssize_t utf8_length;
    char *utf8;
    void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)

=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    Py_UNICODE *str;
    Py_hash_t hash;
    int state;
    PyObject *defenc;
} PyUnicodeObject;

=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is 
wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.

> Will the format codes returning a Py_UNICODE pointer with
> PyArg_ParseTuple be deprecated?

Because Python 2.x is still dominant and it's already hard enough to port C 
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

> Do you think the wstr representation could be removed in some future
> version of Python?

Conversion to wchar_t* is common, especially on Windows. But I don't know if 
we *have to* cache the result. Is it cached by the way? Or is wstr only used 
when a string is created from Py_UNICODE?

Victor