[Python-Dev] PEP 393 review
Victor Stinner
victor.stinner at haypocalc.com
Thu Aug 25 00:29:19 CEST 2011
> With this PEP, the unicode object overhead grows to 10 pointer-sized
> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
> Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
/* no more utf8_length, utf8, str */
/* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)
=> "a" is 58 bytes (with utf8 for free, without wchar_t)
For object allocated with the new API, we can use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
/* no more str pointer */
/* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)
=> "é" is 74 bytes (without utf8 / wchar_t)
For the legacy API:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)
=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)
The current struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_UNICODE *str;
Py_hash_t hash;
int state;
PyObject *defenc;
} PyUnicodeObject;
=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
wchar_t)
... but the code (maybe only the macros?) and debuging will be more complex.
> Will the format codes returning a Py_UNICODE pointer with
> PyArg_ParseTuple be deprecated?
Because Python 2.x is still dominant and it's already hard enough to port C
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).
> Do you think the wstr representation could be removed in some future
> version of Python?
Conversion to wchar_t* is common, especially on Windows. But I don't know if
we *have to* cache the result. Is it cached by the way? Or is wstr only used
when a string is created from Py_UNICODE?
Victor
More information about the Python-Dev
mailing list