[Python-Dev] Internal representation of strings and Micropython

Thu Jun 5 09:00:01 CEST 2014

Glenn Linderman writes:

 > 3) (Most space efficient) One cached entry, that caches the last 
 > codepoint/byte position referenced. UTF-8 is able to be traversed in 
 > either direction, so "next/previous" codepoint access would be 
 > relatively fast (and such are very common operations, even when indexing 
 > notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

Been there, tried that (Emacsen).  Either it's a YAGNI (moving forward
or backward over UTF-8 by characters short distances is plenty fast,
especially if you've got a lot of ASCII you can move by words for
somewhat longer distances), or it's not good enough.  There *may* be a
sweet spot, but it's definitely smaller than the one on Sharapova's
racket.

 > 4) (Fixed size caches)  N entries, one for the last codepoint, and 
 > others at Codepoint_Length/N intervals.  N could be tunable.

To achieve space saving, cache has to be quite small, and the bigger
your integers, the smaller it gets.  A naive implementation on 64-bit
machine would give you 16 bytes/cache entry.  Using a non-native size
will be a space win, but needs care in implementation.  Initializing
the cache is very expensive for small strings, so you need conditional
and maybe lazy initialization (for large strings).

By the way, there's also

10) Keep counts of the leading and trailing number of ASCII
    (one-octet) characters.  This is often a *huge* win; it's quite
    common to encounter documents where size - lc - tc = 2 (ie,
    there's only one two-octet character in the document).

11) Keep a list (or tree) of most-recently-accessed positions.

Despite my negative experience with multibyte encodings in Emacsen,
I'm persuaded by the arguments that there probably aren't all that
many places in core Python where indexing is used in an essential way,
so MicroPython itself can probably optimize those "behind the
scenes".  Application programmers in the embedded context may be
expected to be deal with the need to avoid random access algorithms
and use iterators and generators to accomplish most tasks.