[Python-Dev] Internal representation of strings and Micropython

Wed Jun 4 09:03:22 CEST 2014

On Wed, Jun 4, 2014 at 3:23 PM, Guido van Rossum <guido at python.org> wrote:
> On Tue, Jun 3, 2014 at 7:32 PM, Chris Angelico <rosuav at gmail.com> wrote:
>>
>> On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <steve at pearwood.info>
>> wrote:
>> > * Having a build-time option to restrict all strings to ASCII-only.
>> >
>> >   (I think what they mean by that is that strings will be like Python 2
>> >   strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
>>
>> What I was actually suggesting along those lines was that the str type
>> still be notionally a Unicode string, but that any codepoints >127
>> would either raise an exception or blow an assertion, and all the code
>> to handle multibyte representations would be compiled out.
>
>
> That would be a pretty lousy option.
>
>> So there'd
>> still be a difference between strings of text and streams of bytes,
>> but all encoding and decoding to/from ASCII-compatible encodings would
>> just point to the same bytes in RAM.
>
> I suppose this is why you propose to reject 128-255?

Correct. It would allow small devices to guarantee that strings are
compact (MicroPython is aimed primarily at an embedded controller),
guarantee identity transformations in several common encodings (and
maybe this sort of build wouldn't ship with any non-ASCII-compat
encodings at all), and never demonstrate behaviour different from
CPython's except by explicitly failing.

>> Risk: Someone would implement that with assertions, then compile with
>> assertions disabled, test only with ASCII, and have lurking bugs.
>
>
> Never mind disabling assertions -- even with enabled assertions you'd have
> to expect most Python programs to fail with non-ASCII input.

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to be
copied and doubled in size (lots of heap space used).

> Then again the UTF-8 option would be pretty devastating too for anything
> manipulating strings (especially since many Python APIs are defined using
> indexes, e.g. the re module).

That's what I thought, too, but a quick poll on python-list suggests
that indexing isn't nearly as common as I had thought it to be. On a
smallish device, you won't have megabytes of string to index, so even
O(N) indexing can't get pathological. (This would be an acknowledged
limitation of micropython as a Unix Python - "it's designed for small
programs, and it's performance-optimized for small programs, so it
might get pathologically slow on certain large data manipulations".)

> Why not support variable-width strings like CPython 3.4?

That was my first recommendation, and in fact I started writing code
to implement parts of PEP 393, with a view to basically doing it the
same way in both Pythons. But discussion on the tracker issue showed a
certain amount of hostility toward the potential expansion of strings,
particularly in the worst-case example of appending a single SMP
character onto a long ASCII string.

ChrisA