[Python-Dev] PEP 393: Flexible String Representation

Nick Coghlan ncoghlan at gmail.com
Wed Jan 26 13:30:37 CET 2011


On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease <digitalxero at gmail.com> wrote:
> On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> I also don't see how this could save a lot of memory. As an example
>> take a French text with say 10mio code points. This would end up
>> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
>> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
>> on how many accents are used). That's a saving of -10MB compared to
>> today's implementation :-)
>
> If I am reading the pep right, which I may not be as I am no expert on
> unicode, the new implementation would actually give a 10MB saving
> since the wchar field is optional, so only the str (Latin-1) and utf8
> fields would need to be stored. How it decides not to store one field
> or another would need to be clarified in the pep is I am right.

The PEP actually does define that already:

PyUnicode_AsUTF8 populates the utf8 field of the existing string,
while PyUnicode_AsUTF8String creates a *new* string with that field
populated.

PyUnicode_AsUnicode will populate the wstr field (but doing so
generally shouldn't be necessary).

For a UCS4 build, my reading of the PEP puts the memory savings for a
100 code point string as follows:

Current size: 400 bytes (regardless of max code point)

New initial size (max code point < 256): 100 bytes (75% saving)
New initial size (max code point < 65536): 200 bytes (50% saving)
New initial size (max code point >= 65536): 400 bytes (no saving)

For each of the "new" strings, they may consume additional storage if
the utf8 or wstr fields get populated. The maximum possible size would
be a UCS4 string (max code point >= 65536) on a sizeof(wchar_t) == 2
system with the utf8 string populated. In such cases, you would
consume at least 700 bytes, plus whatever additional memory is needed
to encode the non-BMP characters into UTF-8 and UTF-16.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list