[Python-Dev] PEP 393: Flexible String Representation

Thu Jan 27 22:37:32 CET 2011

Am 25.01.2011 12:08, schrieb Nick Coghlan:
> On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
>> representation. It is thus identical to the existing
>> _PyUnicode_AsString, which is removed. The function will compute the
>> utf8 representation when first called. Since this representation will
>> consume memory until the string object is released, applications
>> should use the existing PyUnicode_AsUTF8String where possible
>> (which generates a new string object every time). API that implicitly
>> converts a string to a char* (such as the ParseTuple functions) will
>> use this function to compute a conversion.
> 
> I'm not entirely clear as to what "this function" is referring to here.

PyUnicode_AsUTF8 (i.e. the one where you don't need to release the
memory). I made this explicit now.

> I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready"
> might be a better option (PyType_Ready seems a better analogy for a
> "I've filled everything in, please calculate the derived fields now"
> than Py_Finalize).

Ok, changed (when I was pondering about this PEP, this once occurred
me also, but I forgot when I typed it in).

> 
> More generally, let me see if I understand the proposed structure correctly:
> 
> str: Always set once PyUnicode_Ready() has been called.
>   Always points to the canonical representation of the string (as
> indicated by PyUnicode_Kind)
> length: Always set once PyUnicode_Ready() has been called. Specifies
> the number of code points in the string.

Correct.

> wstr: Set only if PyUnicode_AsUnicode has been called on the string.

Might also be set when the string is created through
PyUnicode_FromUnicode was used, and PyUnicode_Ready hasn't been called.

>     If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE)
> or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr
> = str, otherwise wstr points to dedicated memory
> wstr_length: Valid only if wstr != NULL
>     If wstr_length != length, indicates presence of surrogate pairs in
> a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() ==
> PyUnicode_4BYTE).

Correct.

> utf8: Set only if PyUnicode_AsUTF8 has been called on the string.
>     If string contents are pure ASCII, utf8 = str, otherwise utf8
> points to dedicated memory.
> utf8_length: Valid only if utf8_ptr != NULL

Correct.

> One change I would propose is that rather than hiding flags in the low
> order bits of the str pointer, we expand the use of the existing
> "state" field to cover the representation information in addition to
> the interning information.

Thanks for the idea; done.

> I would also suggest explicitly flagging
> internally whether or not a 1 byte string is ASCII or Latin-1 along
> the lines of:

Not sure about that. It would complicate PyUnicode_Kind.

Instead, I'd rather fill out utf8 right away if we can use sharing
(e.g. when the string is created with a max value <128, or
PyUnicode_Ready has determined that).

So I keep it for the moment as reserved (but would use it when
str is NULL, as I'd have to fill in some value, anyway).

Regards,
Martin