string storage [was: Re: imaplib: is this really so unwieldy?]

Wed May 26 13:26:53 EDT 2021

On 5/26/2021 12:07 PM, Chris Angelico wrote:
> On Thu, May 27, 2021 at 1:59 AM Jon Ribbens via Python-list
> <python-list at python.org> wrote:
>>
>> On 2021-05-26, Alan Gauld <alan.gauld at yahoo.co.uk> wrote:
>>> On 25/05/2021 23:23, Terry Reedy wrote:
>>>> In CPython's Flexible String Representation all characters in a string
>>>> are stored with the same number of bytes, depending on the largest
>>>> codepoint.
>>>
>>> I'm learning lots of new things in this thread!
>>>
>>> Does that mean that if I give Python a UTF8 string that is mostly single
>>> byte characters but contains one 4-byte character that Python will store
>>> the string as all 4-byte characters?

Note that while unix uses utf-8, Windows uses utf-16.

>>> If so, doesn't that introduce a pretty big storage overhead for
>>> large strings?
>>
>> Memory is cheap ;-)
>>
> 
> This is true, but sometimes memory translates into time - either
> direction. When the Flexible String Representation came in, it was
> actually an alternative to using four bytes per character on ALL
> strings (not just those that contain non-BMP characters),

Except on Windows, where CPython used 2 bytes/char + surrogates for 
non-BMP char.  This meant that indexing did not quite work on Windows 
and that applications that allowed astral chars and wanted to work on 
all systems had to have separate code for Windows and unix-based systems.

> and it
> actually improved performance quite notably, despite some additional
> complications.

And it made CPython text manipulation code work on all CPython system.

> Performance optimization is a funny science :)

-- 
Terry Jan Reedy