[Python-Dev] Internal representation of strings and Micropython

Wed Jun 4 22:10:40 CEST 2014

On Wed, Jun 04, 2014 at 03:32:25PM +0000, Steve Dower wrote:
> Steven D'Aprano wrote:
> > The language semantics says that a string is an array of code points. Every
> > index relates to a single code point, no code point extends over two or more
> > indexes.
> > There's a 1:1 relationship between code points and indexes. How is direct
> > indexing "likely to be incorrect"?
> 
> We're discussing the behaviour under a different (hypothetical) design 
> decision than a 1:1 relationship between code points and indexes, so 
> arguing from that stance doesn't make much sense.

I'm open to different implementations. I earlier even suggested that the 
choice of O(1) indexing versus O(N) indexing was a quality of 
implementation issue, not a make-or-break issue for whether something 
can call itself Python (or even 99% compatible with Python").

But I don't believe that exposing that implementation at the Python 
level is valid: regardless of whether it is efficient or not, I should 
be able to write code like this:

a = [mystring[i] for i in range(len(mystring))]
b = list(mystring)
assert a == b

That is not the case if you expose the underlying byte-level 
implementation at the Python level, and treat strings as an array of 
*bytes*. Paul seems to want to do this, or at least he wants Python 4 
to do this. I think it is *completely* inappropriate to do so.

I *think* you may agree with me, (correct me if I'm wrong) because you 
go on to agree with me:

> > e.g.
> > 
> > s = "---ÿ---"
> > offset = s.index('ÿ')
> > assert s[offset] == 'ÿ'
> > 
> > That cannot fail with Python's semantics.
> 
> Agreed, and it shouldn't 

but I'm not actually sure.

> (I was actually referring to the optimization 
> being incorrect for the goal, not the language semantics). What you'd 
> probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may 
> be surprising, but is also correct.

You don't seem to be taking about sys.getsizeof, so I guess you're 
talking about something at the C level (or other underlying 
implementation), ignoring the object overhead. I don't know why you 
think I'd find that surprising -- one cannot fit 0x10FFFF Unicode code 
points in a single byte, so whether you use UTF-32, UTF-16, UTF-8, 
Python 3.3's FSR or some other implementation, at least some code points 
are going to use more than one byte.

> But what are you trying to achieve (why are you writing this code)? 
> All this example really shows is that you're only using indexing for 
> trivial purposes.

I'm trying to understand what point you are trying to make, because I'm 
afraid I don't quite get it.

[...]
> If copying into a separate list is a problem (memory-wise), 
> re.finditer('\\S+', string) also provides the same behaviour and gives 
> me the sliced string, so there's no need to index for anything.

finditer returns a bunch of MatchObjects, which give you the indexes 
of the found substring. Whether you do it yourself, or get the re 
module to do it, you're indexing somewhere.

> The downside is that it isn't as easy to teach as the 1:1 
> relationship, and currently it doesn't perform as well *in CPython*. 
> But if MicroPython is focusing on size over speed, I don't see any 
> reason why they shouldn't permit different performance characteristics 
> and require a slightly different approach to highly-optimized coding.

I don't have a problem with different implementations, so long as that 
implementation isn't exposed at the Python level with changes of 
semantics such as breaking the promise that a string is an array of code 
points, not of bytes.

> In any case, this is an interesting discussion with a genuine effect 
> on the Python interpreter ecosystem. Jython and IronPython already 
> have different string implementations from CPython - having official 
> (and hopefully flexible) guidance on deviations from the reference 
> implementation would I think help other implementations provide even 
> more value, which is only a good thing for Python.

Yes, agreed.

-- 
Steven