string storage [was: Re: imaplib: is this really so unwieldy?]

Fri May 28 02:54:00 EDT 2021

Il 27/05/2021 05:54, Cameron Simpson ha scritto:
> On 26May2021 12:11, Jon Ribbens <jon+usenet at unequivocal.eu> wrote:
>> On 2021-05-26, Alan Gauld <alan.gauld at yahoo.co.uk> wrote:
>>> I confess I had just assumed the unicode strings were stored
>>> in native unicode UTF8 format.
>>
>> If you do that then indexing and slicing strings becomes very slow.
> 
> True, but that isn't necessarily a show stopper. My impression, on
> reflection, is that most slicing is close to the beginning or end of a
> string, and that _most strings are small. (Alan has exceptions at least
> to the latter.) In those circumstances, the cost of slicing a variable
> width encoding is greatly mitigated.
> 
> Indexing is probably more general (in my subjective hand waving
> guesstimation). But... how common is indexing into large strings?
> Versus, say, iteration over a large string?
> 
> I was surprised when getting introduced to Golang a few years ago that
> it stores all Strings as UTF8 byte sequences. And when writing Go code,
> I found very few circumstances where that would actually bring
> performance issues, which I attribute in part to my suggestions above
> about when, in practical terms, we slice and index strings.
> 
> If the internal storage is UTF8, then in an ecosystem where all, or
> most, text files are themselves UTF8 then reading a text file has zero
> decoding cost - you can just read the bytes and store them! And to write
> a String out to a UTF8 file, you just copy the bytes - zero encoding!
> 
--------------------------------
> Also, UTF8 is a funny thing - it is deliberately designed so that you
> can just jump into the middle of an arbitrary stream of UTF8 bytes and
> find the character boundaries. That doesn't solve slicing/indexing in
> general, but it does avoid any risk of producing mojibake just by
> starting your decode at a random place.
> 

Perhaps you are referring to what the python language does if you jump 
to an albiter position of an utf8 string. Otherwise, before you start 
decoding, you should align at the beginning of an utf8 character by 
discarding the bytes that meet the following test:

(byte & 0xc0) == 0x80 /* Clang */

--------------------------------
> Cheers,
> Cameron Simpson <cs at cskk.id.au>
>