[Python-Dev] PEP 393: Flexible String Representation

Tue Jan 25 01:28:43 CET 2011

On Mon, 2011-01-24 at 21:17 +0100, "Martin v. Löwis" wrote:

... snip ...

> I'd like to propose PEP 393, which takes a different approach,
> addressing both problems simultaneously: by getting a flexible
> representation (one that can be either 1, 2, or 4 bytes), we can
> support the full range of Unicode on all systems, but still use
> only one byte per character for strings that are pure ASCII (which
> will be the majority of strings for the majority of users).

There was some discussion about this at PyCon 2010, where we referred to
it casually as "Pay-as-you-go unicode"

... snip ...

> - str: shortest-form representation of the unicode string; the lower
>   two bits of the pointer indicate the specific form:
>   01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4);
Repetition of "11"; I'm guessing that the 2byte/UCS-2 should read "10",
so that they give the width of the char representation.

>   00 => null pointer

Naturally this assumes that all pointers are at least 4-byte aligned (so
that they can be masked off).  I assume that this is sane on every
platform that Python supports, but should it be spelled out explicitly
somewhere in the PEP?

> 
>   The string is null-terminated (in its respective representation).
> - hash, state: same as in Python 3.2
> - utf8_length, utf8: UTF-8 representation (null-terminated)
If this is to share its buffer with the "str" representation for the
Latin-1 case, then I take it this ptr will typically be (str & ~4) ?
i.e. only "str" has the low-order-bit type info.

> - wstr_length, wstr: representation in platform's wchar_t
>   (null-terminated). If wchar_t is 16-bit, this form may use surrogate
>   pairs (in which cast wstr_length differs form length).
> 
> All three representations are optional, although the str form is
> considered the canonical representation which can be absent only
> while the string is being created.

Spelling out the meaning of "optional":
  does this mean that the relevant ptr is NULL; if so, if utf8 is null,
is utf8_length undefined, or is it some dummy value?  (i.e. is the
pointer the first thing to check before we know if utf8_length is
meaningful?); similar consideration for the wstr representation.

> The Py_UNICODE type is still supported but deprecated. It is always
> defined as a typedef for wchar_t, so the wstr representation can double
> as Py_UNICODE representation.
> 
> The str and utf8 pointers point to the same memory if the string uses
> only ASCII characters (using only Latin-1 is not sufficient). The str
...though the ptrs are non-equal for this case, as noted above, as "str"
has an 0x1 typecode.

> and wstr pointers point to the same memory if the string happens to
> fit exactly to the wchar_t type of the platform (i.e. uses some
> BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
> non-BMP characters if sizeof(wchar_t) is 4).
> 
> If the string is created directly with the canonical representation
> (see below), this representation doesn't take a separate memory block,
> but is allocated right after the PyUnicodeObject struct.

Is the idea to do pointer arithmentic when deleting the PyUnicodeObject
to determine if the ptr is in that location, and not delete it if it is,
or is there some other way of determining whether the pointers need
deallocating?  If the former, is this embedding an assumption that the
underlying allocator couldn't have allocated a buffer directly adjacent
to the PyUnicodeObject.  I know that GNU libc's malloc/free
implementation has gaps of two machine words between each allocation;
off the top of my head I'm not sure if the optimized Object/obmalloc.c
allocator enforces such gaps.

... snip ...

Extra section:

GDB Debugging Hooks
-------------------
Tools/gdb/libpython.py contains debugging hooks that embed knowledge
about the internals of CPython's data types, include PyUnicodeObject
instances.  It will need to be slightly updated to track the change.

(I can do that change if need be; it shouldn't be too hard).

Hope this is helpful
Dave