[Python-Dev] The future of the wchar_t cache

Sat Oct 20 09:01:52 EDT 2018

Serhiy Storchaka schrieb am 20.10.2018 um 13:06:
> Currently the PyUnicode object contains two caches: for UTF-8
> representation and for wchar_t representation. They are needed not for
> optimization but for supporting C API which returns borrowed references for
> such representations.
> 
> The UTF-8 cache always was in unicode objects (but in Python 2 it was not a
> UTF-8 cache, but a 8-bit representation cache). Initially it was needed for
> compatibility with 8-bit str, for implementing the "s" and "z" format units
> in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and
> PyUnicode_AsUTF8AndSize().
> 
> The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the
> former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of
> wchar_t, and the C API which returned a pointer to Py_UNICODE content
> returns now a pointer to the cached wchar_t representation. It is the "u"
> and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(),
> PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(),
> PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().
> 
> All this increase the size of the unicode object. It includes the constant
> overhead of additional pointer and size fields, and the overhead of the
> cached representation proportional to the string length. The following
> table contains number of bytes per character for different kinds, with and
> without filling specified caches.
> 
>        raw  +utf8     +wchar_t       +utf8+wchar_t
>                    Windows  Linux   Windows  Linux
> ASCII   1     1       3       5        3       5
> UCS1    1    2-3      3       5       4-5     6-7
> UCS2    2    3-5      2       6       3-5     7-9
> UCS4    4    5-8     6-8      4       7-12    5-8
> 
> There is also a new C API added in 3.3 for getting wchar_t representation
> without using the cache: PyUnicode_AsWideChar() and
> PyUnicode_AsWideCharString(). Currently it uses the cache, this has
> benefits and disadvantages.
> 
> Old Py_UNICODE based API is deprecated, and will be removed eventually.
> I want to ask about the future of the wchar_t cache. Is the benefit of
> caching the wchar_t representation larger the disadvantage of spending more
> memory? The wchar_t representation is so natural for Windows API as the
> UTF8 representation for POSIX API. But in all other cases it is just waste
> of memory. Are there reasons of keeping the wchar_t cache after removing
> the deprecated API?

I'd be happy to get rid of it. But regarding the use under Windows, I
wonder if there's interest in keeping it as a special Windows-only feature,
e.g. to speed up the data exchange with the Win32 APIs. I guess it would
have to provide a visible (performance?) advantage to justify such special
casing over the code removal.

Stefan