[Python-Dev] Internal representation of strings and Micropython

Tim Delaney timothy.c.delaney at gmail.com
Fri Jun 6 23:35:59 CEST 2014


On 7 June 2014 00:52, Paul Sokolovsky <pmiscml at gmail.com> wrote:

> > At heart, this is exactly what the Python 3 "str" type is. The
> > universal convention is "code points".
>
> Yes. Except for one small detail - Python3 specifies these code points
> to be Unicode code points. And Unicode is a very bloated thing.
>
> But if we drop that "Unicode" stipulation, then it's also exactly what
> MicroPython implements. Its "str" type consists of codepoints, we don't
> have pet names for them yet, like Unicode does, but their numeric
> values are 0-255. Note that it in no way limits encodings, characters,
> or scripts which can be used with MicroPython, because just like
> Unicode, it support concept of "surrogate pairs" (but we don't call it
> like that) - specifically, smaller code points may comprise bigger
> groupings. But unlike Unicode, we don't stipulate format, value or
> other constraints on how these "surrogate pairs"-alikes are formed,
> leaving that to users.


I think you've missed my point.

There is absolutely nothing conceptually bloaty about what a Python 3
string is. It's just like a 7-bit ASCII string, except each entry can be
from a larger table. When you index into a Python 3 string, you get back
exactly *one valid entry* from the Unicode code point table. That plus the
length of the string, plus the guarantee of immutability gives everything
needed to layer the rest of the string functionality on top.

There are no surrogate pairs - each code point is standalone (unlike code
*units*). It is conceptually very simple. The implementation may be
difficult (if you're trying to do better than 4 bytes per code point) but
the concept is dead simple.

If the MicroPython string type requires people *using* it to deal with
surrogates (i.e. indexing could return a value that is not a valid Unicode
code point) then it will have broken the conceptual simplicity of the
Python 3 string type (and most certainly can't be considered in any way
compatible).

Tim Delaney
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140607/034954c2/attachment.html>


More information about the Python-Dev mailing list