Flexible string representation, unicode, typography, ...

Fri Aug 31 11:13:40 EDT 2012

On Fri, Aug 31, 2012 at 6:32 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> That's one thing that I'm unclear about -- under what circumstances will
> a string be in compact versus non-compact form?

I understand it to be entirely dependent on which API is used to
construct.  The legacy API generates legacy strings, and the new API
generates compact strings.  From the comments in unicodeobject.h:

    /* ASCII-only strings created through PyUnicode_New use the PyASCIIObject
    structure. state.ascii and state.compact are set, and the data
    immediately follow the structure. utf8_length and wstr_length can be found
    in the length field; the utf8 pointer is equal to the data pointer. */

...

    Legacy strings are created by PyUnicode_FromUnicode() and
    PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
    when PyUnicode_READY() is called.

...

    /* Non-ASCII strings allocated through PyUnicode_New use the
    PyCompactUnicodeObject structure. state.compact is set, and the data
    immediately follow the structure. */

Since I'm not sure that this is clear, note that compact vs. legacy
does not describe which character width is used (except that
PyASCIIObject strings are always 1 byte wide).  Legacy and compact
strings can each use the 1, 2, or 4 byte representations.  "Compact"
merely denotes that the character data is stored inline with the
struct (as opposed to being stored somewhere else and pointed at by
the struct), not the relative size of the string data.  Again from the
comments:

    Compact strings use only one memory block (structure + characters),
    whereas legacy strings use one block for the structure and one block
    for characters.

Cheers,
Ian