[Python-Dev] PEP 393: Special-casing ASCII-only strings

Thu Sep 15 21:48:01 CEST 2011

On Thu, Sep 15, 2011 at 8:50 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> In reviewing memory usage, I found potential for saving more memory for
> ASCII-only strings. Both Victor and Guido commented that something like
> this be done; Antoine had asked whether there was anything that could
> be done. Here is the idea:
>
> In an ASCII-only string, the UTF-8 representation is shared with the
> canonical one-byte representation. This would allow to drop the
> UTF-8 pointer and the UTF-8 length field; instead, a flag in the state
> would indicate that these fields are not there.
>
> Likewise, the wchar_t/Py_UNICODE length can be shared (even though the
> data cannot), since the ASCII-only string won't contain any surrogate
> pairs.
>
> To comply with the C aliasing rules, the structures would look like this:
>
> typedef struct {
>    PyObject_HEAD
>    Py_ssize_t length;
>    union {
>        void *any;
>        Py_UCS1 *latin1;
>        Py_UCS2 *ucs2;
>        Py_UCS4 *ucs4;
>    } data;
>    Py_hash_t hash;
>    int state;     /* may include SSTATE_SHORT_ASCII flag */
>    wchar_t *wstr;
> } PyASCIIObject;
>
>
> typedef struct {
>    PyASCIIObject _base;
>    Py_ssize_t utf8_length;
>    char *utf8;
>    Py_ssize_t wstr_length;
> } PyUnicodeObject;
>
> Code that directly accesses the structures would become more
> complex; code that use the accessor macros wouldn't notice.
>
> As a result, ASCII-only strings would lose three pointers,
> and shrink to their 3.2 structure size. Since they also save
> in the individual characters, strings with more than
> 3 characters (16-bit Py_UNICODE) or more than one character
> (32-bit Py_UNICODE) would see a total size reduction compared
> to 3.2.
>
> Objects created throught the legacy API (PyUnicode_FromUnicode)
> that are only later found to be ASCII-only (in PyUnicode_Ready)
> would still have the UTF-8 pointer shared with the data pointer,
> but keep including separate fields for pointer & size.
>
> What do you think?
>
> Regards,
> Martin
>
> P.S. There are similar reductions that could be applied
> to the wstr_length in general: on 32-bit wchar_t systems,
> it could be always dropped, on a 16-bit wchar_t system,
> it could be dropped for UCS-2 strings. However, I'm not
> proposing these, as I think the increase in complexity
> is not worth the savings.

This sounds like a good plan.

-- 
--Guido van Rossum (python.org/~guido)