[Python-Dev] The future of the wchar_t cache
Stefan Behnel
stefan_ml at behnel.de
Sat Oct 20 09:01:52 EDT 2018
Serhiy Storchaka schrieb am 20.10.2018 um 13:06:
> Currently the PyUnicode object contains two caches: for UTF-8
> representation and for wchar_t representation. They are needed not for
> optimization but for supporting C API which returns borrowed references for
> such representations.
>
> The UTF-8 cache always was in unicode objects (but in Python 2 it was not a
> UTF-8 cache, but a 8-bit representation cache). Initially it was needed for
> compatibility with 8-bit str, for implementing the "s" and "z" format units
> in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and
> PyUnicode_AsUTF8AndSize().
>
> The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the
> former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of
> wchar_t, and the C API which returned a pointer to Py_UNICODE content
> returns now a pointer to the cached wchar_t representation. It is the "u"
> and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(),
> PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(),
> PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().
>
> All this increase the size of the unicode object. It includes the constant
> overhead of additional pointer and size fields, and the overhead of the
> cached representation proportional to the string length. The following
> table contains number of bytes per character for different kinds, with and
> without filling specified caches.
>
> raw +utf8 +wchar_t +utf8+wchar_t
> Windows Linux Windows Linux
> ASCII 1 1 3 5 3 5
> UCS1 1 2-3 3 5 4-5 6-7
> UCS2 2 3-5 2 6 3-5 7-9
> UCS4 4 5-8 6-8 4 7-12 5-8
>
> There is also a new C API added in 3.3 for getting wchar_t representation
> without using the cache: PyUnicode_AsWideChar() and
> PyUnicode_AsWideCharString(). Currently it uses the cache, this has
> benefits and disadvantages.
>
> Old Py_UNICODE based API is deprecated, and will be removed eventually.
> I want to ask about the future of the wchar_t cache. Is the benefit of
> caching the wchar_t representation larger the disadvantage of spending more
> memory? The wchar_t representation is so natural for Windows API as the
> UTF8 representation for POSIX API. But in all other cases it is just waste
> of memory. Are there reasons of keeping the wchar_t cache after removing
> the deprecated API?
I'd be happy to get rid of it. But regarding the use under Windows, I
wonder if there's interest in keeping it as a special Windows-only feature,
e.g. to speed up the data exchange with the Win32 APIs. I guess it would
have to provide a visible (performance?) advantage to justify such special
casing over the code removal.
Stefan
More information about the Python-Dev
mailing list