Micro Python -- a lean and efficient implementation of Python 3

Wed Jun 4 07:01:52 EDT 2014

On 2014-06-04 00:58, Paul Rubin wrote:
> Steven D'Aprano <steve at pearwood.info> writes:
> >> Maybe there's a use-case for a microcontroller that works in
> >> ISO-8859-5 natively, thus using only eight bits per character, 
> > That won't even make the Russians happy, since in Russia there
> > are multiple incompatible legacy encodings.
> 
> I've never understood why not use UTF-8 for everything.

If you use UTF-8 for everything, then you end up in a world where
string-indexing (see ChrisA's other side thread on this topic) is no
longer an O(1) operation, but an O(N) operation.  Some of us slice
strings for a living. ;-)  I understand that using UTF-32 would allow
us to maintain O(1) indexing at the cost of every string occupying 4
bytes per character.  The FSR (again, as I understand it) allows
strings that fit in one-byte-per-character to use that, scaling up to
use wider characters internally as they're actually needed/used.

At the cost of complexity and non-constant memory space, an O(N)
algorithm could be tweaked down to O(log N) by using an internal
balanced tree of offsets-to-chunks (where the chunk-size was the size
of a block where it was faster to scan linearly than to navigate the
tree).  One might even endow the algorithm with FSR smarts, so each
chunk/fragment could be a different encoding in memory, and linearly
iterating over the string would walk the tree, returning each decoded
piece. </random_ramblings>

-tkc