string storage [was: Re: imaplib: is this really so unwieldy?]

Wed May 26 23:54:57 EDT 2021

On 26May2021 12:11, Jon Ribbens <jon+usenet at unequivocal.eu> wrote:
>On 2021-05-26, Alan Gauld <alan.gauld at yahoo.co.uk> wrote:
>> I confess I had just assumed the unicode strings were stored
>> in native unicode UTF8 format.
>
>If you do that then indexing and slicing strings becomes very slow.

True, but that isn't necessarily a show stopper. My impression, on 
reflection, is that most slicing is close to the beginning or end of a 
string, and that _most strings are small. (Alan has exceptions at least 
to the latter.) In those circumstances, the cost of slicing a variable 
width encoding is greatly mitigated.

Indexing is probably more general (in my subjective hand waving 
guesstimation). But... how common is indexing into large strings?  
Versus, say, iteration over a large string?

I was surprised when getting introduced to Golang a few years ago that 
it stores all Strings as UTF8 byte sequences. And when writing Go code, 
I found very few circumstances where that would actually bring 
performance issues, which I attribute in part to my suggestions above 
about when, in practical terms, we slice and index strings.

If the internal storage is UTF8, then in an ecosystem where all, or 
most, text files are themselves UTF8 then reading a text file has zero 
decoding cost - you can just read the bytes and store them! And to write 
a String out to a UTF8 file, you just copy the bytes - zero encoding!

Also, UTF8 is a funny thing - it is deliberately designed so that you 
can just jump into the middle of an arbitrary stream of UTF8 bytes and 
find the character boundaries. That doesn't solve slicing/indexing in 
general, but it does avoid any risk of producing mojibake just by 
starting your decode at a random place.

Cheers,
Cameron Simpson <cs at cskk.id.au>