[Python-Dev] PEP 393: Flexible String Representation

"Martin v. Löwis" martin at v.loewis.de
Thu Jan 27 22:37:32 CET 2011


Am 25.01.2011 12:08, schrieb Nick Coghlan:
> On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
>> representation. It is thus identical to the existing
>> _PyUnicode_AsString, which is removed. The function will compute the
>> utf8 representation when first called. Since this representation will
>> consume memory until the string object is released, applications
>> should use the existing PyUnicode_AsUTF8String where possible
>> (which generates a new string object every time). API that implicitly
>> converts a string to a char* (such as the ParseTuple functions) will
>> use this function to compute a conversion.
> 
> I'm not entirely clear as to what "this function" is referring to here.

PyUnicode_AsUTF8 (i.e. the one where you don't need to release the
memory). I made this explicit now.

> I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready"
> might be a better option (PyType_Ready seems a better analogy for a
> "I've filled everything in, please calculate the derived fields now"
> than Py_Finalize).

Ok, changed (when I was pondering about this PEP, this once occurred
me also, but I forgot when I typed it in).

> 
> More generally, let me see if I understand the proposed structure correctly:
> 
> str: Always set once PyUnicode_Ready() has been called.
>   Always points to the canonical representation of the string (as
> indicated by PyUnicode_Kind)
> length: Always set once PyUnicode_Ready() has been called. Specifies
> the number of code points in the string.

Correct.

> wstr: Set only if PyUnicode_AsUnicode has been called on the string.

Might also be set when the string is created through
PyUnicode_FromUnicode was used, and PyUnicode_Ready hasn't been called.

>     If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE)
> or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr
> = str, otherwise wstr points to dedicated memory
> wstr_length: Valid only if wstr != NULL
>     If wstr_length != length, indicates presence of surrogate pairs in
> a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() ==
> PyUnicode_4BYTE).

Correct.

> utf8: Set only if PyUnicode_AsUTF8 has been called on the string.
>     If string contents are pure ASCII, utf8 = str, otherwise utf8
> points to dedicated memory.
> utf8_length: Valid only if utf8_ptr != NULL

Correct.

> One change I would propose is that rather than hiding flags in the low
> order bits of the str pointer, we expand the use of the existing
> "state" field to cover the representation information in addition to
> the interning information.

Thanks for the idea; done.

> I would also suggest explicitly flagging
> internally whether or not a 1 byte string is ASCII or Latin-1 along
> the lines of:

Not sure about that. It would complicate PyUnicode_Kind.

Instead, I'd rather fill out utf8 right away if we can use sharing
(e.g. when the string is created with a max value <128, or
PyUnicode_Ready has determined that).

So I keep it for the moment as reserved (but would use it when
str is NULL, as I'd have to fill in some value, anyway).

Regards,
Martin


More information about the Python-Dev mailing list