[Python-Dev] PEP 393: Flexible String Representation

Tue Jan 25 12:08:01 CET 2011

On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
> representation. It is thus identical to the existing
> _PyUnicode_AsString, which is removed. The function will compute the
> utf8 representation when first called. Since this representation will
> consume memory until the string object is released, applications
> should use the existing PyUnicode_AsUTF8String where possible
> (which generates a new string object every time). API that implicitly
> converts a string to a char* (such as the ParseTuple functions) will
> use this function to compute a conversion.

I'm not entirely clear as to what "this function" is referring to here.

I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready"
might be a better option (PyType_Ready seems a better analogy for a
"I've filled everything in, please calculate the derived fields now"
than Py_Finalize).

More generally, let me see if I understand the proposed structure correctly:

str: Always set once PyUnicode_Ready() has been called.
  Always points to the canonical representation of the string (as
indicated by PyUnicode_Kind)
length: Always set once PyUnicode_Ready() has been called. Specifies
the number of code points in the string.

wstr: Set only if PyUnicode_AsUnicode has been called on the string.
    If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE)
or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr
= str, otherwise wstr points to dedicated memory
wstr_length: Valid only if wstr != NULL
    If wstr_length != length, indicates presence of surrogate pairs in
a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() ==
PyUnicode_4BYTE).

utf8: Set only if PyUnicode_AsUTF8 has been called on the string.
    If string contents are pure ASCII, utf8 = str, otherwise utf8
points to dedicated memory.
utf8_length: Valid only if utf8_ptr != NULL

One change I would propose is that rather than hiding flags in the low
order bits of the str pointer, we expand the use of the existing
"state" field to cover the representation information in addition to
the interning information. I would also suggest explicitly flagging
internally whether or not a 1 byte string is ASCII or Latin-1 along
the lines of:

/* Already existing string state constants */
#SSTATE_NOT_INTERNED 0x00
#SSTATE_INTERNED_MORTAL 0x01
#SSTATE_INTERNED_IMMORTAL 0x02
/* New string state constants */
#SSTATE_INTERN_MASK 0x03
#SSTATE_KIND_ASCII 0x00
#SSTATE_KIND_LATIN1 0x04
#SSTATE_KIND_2BYTE 0x08
#SSTATE_KIND_4BYTE 0x0C
#SSTATE_KIND_MASK 0x0C

PyUnicode_Kind would then return PyUnicode_1BYTE for strings that were
flagged internally as either ASCII or LATIN1.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia