[Python-Dev] PEP 393: Special-casing ASCII-only strings

Fri Sep 16 07:41:21 CEST 2011

Am 16.09.11 00:42, schrieb Nick Coghlan:
> On Fri, Sep 16, 2011 at 7:39 AM, "Martin v. Löwis
> <martin at v.loewis.de> wrote:
>> Thinking about this, the following may work:
>>
>> - ASCIIObject: state, length, hash, wstr*, data follow
>>
>> - SingleBlockUnicode: ASCIIObject, wstr_len, utf8*, utf8_len, data
>> follow
>>
>> - UnicodeObject: SingleBlockUnicode, data pointer, no data follow
>>
>> This is essentially your proposal, except that the wstr_len is
>> dropped for ASCII strings, and that it uses nested structs.
>>
>> The single-block variants would always be "ready", the full unicode
>> object is ready only if the data pointer is set.
>
> In your "UnicodeObject" here, is the 'data pointer' the
> any/latin1/ucs2/ucs4 union from the original structure definition?

Yes, it is. I'm considering dropping the union again, since you'll
have to cast the data pointer anyway in the compact cases.

> Also, what are the constraints on the "SingleBlockUnicode"? Does it
> only hold strings that can be represented in latin1? Or can the size
>  of the individual elements be more than 1 byte?

Any size - what matters is whether the maximum character is known
at creation time (i.e. whether you've used PyUnicode_New(size, maxchar)
or PyUnicode_FromUnicode(NULL, size)). In the latter case, a Py_UNICODE
block will be allocated in wstr, and the data pointer left NULL.
Then, when PyUnicode_Ready is called, the maxmimum character is
determined in the Py_UNICODE block, and a new data block allocated -
but that will have to be a second memory block (the Py_UNICODE
block is then dropped in _Ready).

Regards,
Martin