string storage [was: Re: imaplib: is this really so unwieldy?]

Wed May 26 08:31:41 EDT 2021

On Wed, May 26, 2021 at 10:04 PM Alan Gauld via Python-list
<python-list at python.org> wrote:
>
> On 25/05/2021 23:23, Terry Reedy wrote:
>
> > In CPython's Flexible String Representation all characters in a string
> > are stored with the same number of bytes, depending on the largest
> > codepoint.
>
> I'm learning lots of new things in this thread!
>
> Does that mean that if I give Python a UTF8 string that is mostly single
> byte characters but contains one 4-byte character that Python will store
> the string as all 4-byte characters?

Nitpick: It won't be "a UTF-8 string"; it will be "a Unicode string".
UTF-8 is a scheme for representing Unicode as a series of bytes, so if
something is UTF-8, it'll be like b'Stra\xc3\x9fe' (with two bytes
representing one non-ASCII character), whereas the corresponding
Unicode string is 'Stra\xdfe' with a single character. Or, if it were
beyond the first 256 characters, '\u2026' is an ellipsis,
b'\xe2\x80\xa6' is a UTF-8 representation of that same character. And
if it's beyond the BMP, then '\U0001F921' is one of the few non-ASCII
characters that you can legitimately write off as a "funny character",
and b'\xf0\x9f\xa4\xa1' is the UTF-8 byte sequence that would carry
that.

So. Yes, if you give Python a large ASCII string with a single non-BMP
character, the entire string *will* be stored as four-byte characters.

(Or, to nitpick against myself: CPython will do this. Other Python
implementations are free to do differently, and for instance, uPy
actually uses UTF-8 like you were predicting. For the rest of this
post, when I say "Python", I actually mean "CPython 3.3 or later".)

> If so, doesn't that introduce a pretty big storage overhead for
> large strings?
>
> >
> >  >>> sys.getsizeof('\U00011111')
> > 80
> >  >>> sys.getsizeof('\U00011111'*2)
> > 84
> >  >>> sys.getsizeof('a\U00011111')
> > 84

Correct. Each additional character is going to cost you four bytes.

> Which is what this seems to be saying.
>
> I confess I had just assumed the unicode strings were stored
> in native unicode UTF8 format.
>

UTF-8 isn't native any more than any other encoding. It's a good
compact format for transmission, but it's quite inefficient for
manipulation. Python opts to spend some memory in order to improve
time, because that's usually the correct tradeoff to make - it means
that indexing in a large string is fast, slicing a large string is
fast, etc, etc, etc.

Also, the truth is that, *in practice*, very few strings will pay this
sort of penalty. If you have a whole lot of (say) Chinese text,
there's going to be a small proportion of ASCII text, but most of the
text is going to be wider characters. Working with most European
languages will require the use of the BMP (which means 16-bit text),
but not anything beyond. And if someone's going to use one emoji from
the supplemental planes (which would require 32-bit text), it's fairly
likely that they'll use multiple.

And if you look at all strings in the Python interpreter, the vast
majority of them will be ASCII-only, getting optimized all the way
down to a single byte. Remember, every module-level variable is stored
in that module's dictionary, keyed by its name - and *most* variable
names in Python are ASCII.

So while it's true that, in theory, a single wide character can cost
you a lot of memory... in practice, this is still a lot more compact,
overall, than storing all strings in UCS-2.

ChrisA