string storage [was: Re: imaplib: is this really so unwieldy?]

Thu May 27 05:32:04 EDT 2021

On Thu, May 27, 2021 at 1:56 PM Cameron Simpson <cs at cskk.id.au> wrote:
>
> On 26May2021 12:11, Jon Ribbens <jon+usenet at unequivocal.eu> wrote:
> >On 2021-05-26, Alan Gauld <alan.gauld at yahoo.co.uk> wrote:
> >> I confess I had just assumed the unicode strings were stored
> >> in native unicode UTF8 format.
> >
> >If you do that then indexing and slicing strings becomes very slow.
>
> True, but that isn't necessarily a show stopper. My impression, on
> reflection, is that most slicing is close to the beginning or end of a
> string, and that _most strings are small. (Alan has exceptions at least
> to the latter.) In those circumstances, the cost of slicing a variable
> width encoding is greatly mitigated.
>
> Indexing is probably more general (in my subjective hand waving
> guesstimation). But... how common is indexing into large strings?
> Versus, say, iteration over a large string?

Common enough that, when all this was originally discussed, O(1)
indexing and slicing was mandated. It wasn't until MicroPython came
along that it was even entertained as a possibility that O(n) slicing
could be reasonable.

> I was surprised when getting introduced to Golang a few years ago that
> it stores all Strings as UTF8 byte sequences. And when writing Go code,
> I found very few circumstances where that would actually bring
> performance issues, which I attribute in part to my suggestions above
> about when, in practical terms, we slice and index strings.
>
> If the internal storage is UTF8, then in an ecosystem where all, or
> most, text files are themselves UTF8 then reading a text file has zero
> decoding cost - you can just read the bytes and store them! And to write
> a String out to a UTF8 file, you just copy the bytes - zero encoding!

True. IF everything is indeed in the same encoding.

> Also, UTF8 is a funny thing - it is deliberately designed so that you
> can just jump into the middle of an arbitrary stream of UTF8 bytes and
> find the character boundaries. That doesn't solve slicing/indexing in
> general, but it does avoid any risk of producing mojibake just by
> starting your decode at a random place.

Yes, that's true, you can avoid mojibake. But you still can't easily
say "which is the 505005th character". The only way for it to work is
to have some kind of string reference type that carries both the
character index and the byte position, and is capable of arithmetic;
and now we're into the messes of pointer manipulation. Whichever way
you do it, you're just moving the mess around.

ChrisA