[Python-Dev] Internal representation of strings and Micropython

Steven D'Aprano steve at pearwood.info
Wed Jun 4 16:12:45 CEST 2014


On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an 
> attempted optimization that is likely to be incorrect for a set of 
> strings. 

I'm afraid I don't understand this argument. The language semantics says 
that a string is an array of code points. Every index relates to a 
single code point, no code point extends over two or more indexes. 
There's a 1:1 relationship between code points and indexes. How is 
direct indexing "likely to be incorrect"?

e.g.

s = "---ÿ---"
offset = s.index('ÿ')
assert s[offset] == 'ÿ'

That cannot fail with Python's semantics.

[Aside: it does fail in Python 2, showing that the idea that "strings 
are bytes" is fatally broken. Fortunately Python has moved beyond that.]


> Splitting, regex, concatenation and formatting are really the 
> main operations that matter, and MicroPython can optimize their 
> implementation of these easily enough for O(N) indexing.

Really? Well, it will be a nice experiment. Fortunately MicroPython runs 
under Linux as well as on embedded systems (a clever decision, by the 
way) so I look forward to seeing how their internal-utf8 implementation 
stacks up against CPython's FSR implementation.

Out of curiosity, when the FSR was proposed, did anyone consider an 
internal UTF-8 representation? If so, why was it rejected?




-- 
Steven


More information about the Python-Dev mailing list