[Python-Dev] Internal representation of strings and Micropython (Steven D'Aprano's summary)

Fri Jun 6 05:54:55 CEST 2014

Steven D'Aprano wrote:

> (1) I asked if it would be okay for MicroPython to *optionally* use 
> nominally Unicode strings limited to ASCII. Pretty much the only 
> response to this as been Guido saying "That would be a pretty lousy 
> option", and since nobody has really defended the suggestion, I think we 
> can assume that it's off the table.

Lousy is not quite the same as forbidden.

Doing it in good faith would require making the limit prominent
in the documentation, and raising some sort of CharacterNotSupported
exception (or at least a warning) whenever there is an attempt to
create a non-ASCII string, even via the C API.

> (2) I asked if it would be okay ... to use an UTF-8 implementation 
> even though it would lead to O(N) indexing operations instead of O(1). 
> There's been some opposition to this, including Guido's:

[Non-ASCII character removed.]

It is bad when quirks -- even good quirks -- of one implementation lead
people to write code that will perform badly on a different Python
implementation.  Cpython has at least delayed obvious optimizations for
this reason.  Changing idiomatic operations from O(1) to O(N) is big
enough to cause a concern.

That said, the target environment itself apparently limits N to small
enough that the problem should be mostly theoretical.  If you want to
be good citizens, then do put a note in the documentation warning that
particularly long strings are likely to cause performance issues unique
to the MicroPython implementation.

(Frankly, my personal opinion is that if you're really optimizing for
space, then long strings will start getting awkward long before N is
big enough for algorithmic complexity to overcome constant factors.)

> ... those strings will need to be transcoded to UTF-8 before they
> can be written or printed, so keeping them as UTF-8 ...

That all assumes that the external world is using UTF-8 anyhow.

Which is more likely to be true if you document it as a limitation
of MicroPython.

> ... but many strings may never be written out:

    print(prefix + s[1:].strip().lower().center(80) + suffix)

> creates five strings that are never written out and one that is.

But looking at the actual strings -- UTF-8 doesn't really hurt
much.  Only the slice and center() are more complex, and for a
string less than 80 characters long, O(N) is irrelevant.

-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ