string storage [was: Re: imaplib: is this really so unwieldy?]

Wed May 26 09:09:01 EDT 2021

On 2021-05-26 08:18, Alan Gauld via Python-list wrote:
> Does that mean that if I give Python a UTF8 string that is mostly
> single byte characters but contains one 4-byte character that
> Python will store the string as all 4-byte characters?

As best I understand it, yes:  the cost of each "character" in a
string is the same for the entire string, so even one lone 4-byte
character in an otherwise 1-byte-character string is enough to push
the whole string to 4-byte characters.  Doesn't effect other strings
though (so if you had a pure 7-bit string and a unicode string, the
former would still be 1-byte-per-char…it's not a global aspect)

If you encode these to a UTF8 byte-string, you'll get the space
savings you seek, but at the cost of sensible O(1) indexing.

Both are a trade-off, and if your data consists mostly of 7-bit ASCII
characters, or lots of small strings, the overhead is less pronounced
than if you have one single large blob of text as a string.

> If so, doesn't that introduce a pretty big storage overhead for
> large strings?

Yes.  Though such large strings tend to be more rare, largely because
they become unweildy for other reasons.

-tkc