imaplib: is this really so unwieldy?

Tue May 25 18:34:28 EDT 2021

On Wed, May 26, 2021 at 8:27 AM Grant Edwards <grant.b.edwards at gmail.com> wrote:
>
> On 2021-05-25, MRAB <python at mrabarnett.plus.com> wrote:
> > On 2021-05-25 16:41, Dennis Lee Bieber wrote:
>
> >> In Python 3, strings are UNICODE, using 1, 2, or 4 bytes PER
> >> CHARACTER (I don't recall if there is a 3-byte version). If your
> >> input bytes are all 7-bit ASCII, then they map directly to a 1-byte
> >> per character string. If they contain any 8-bit upper half
> >> character they may map into a 2-byte per character string.
> >>
> > In CPython 3.3+:
> >
> > U+0000..U+00FF are stored in 1 byte.
> > U+0100..U+FFFF are stored in 2 bytes.
> > U+010000..U+10FFFF are stored in 4 bytes.
>
> Are all characters in a string stored with the same "width"? IOW, does
> the presense of one Unicode character in the range U+010000..U+10FFFF
> in a string that is otherwise all 7-bit ASCII values result in the
> entire string being stored 4-bytes per character? Or is the storage
> width variable within a single string?
>

Yes, any given string has a single width, which makes indexing fast.
The memory cost you're describing can happen, but apart from a BOM
widening an otherwise-ASCII string to 16-bit, there aren't many cases
where you'll get a single wide character in a narrow string. Usually,
if there are any wide characters, there'll be a good number of them
(for instance, text in any particular language will often have a lot
of characters from a block of characters allocated to it).

As an added benefit, keeping all characters the same width simplifies
string searching algorithms, if I'm reading the code correctly. Checks
like >>"foo" in some_string<< can widen the string "foo" to the width
of the target string and then search efficiently.

ChrisA