[Python-Dev] Internal representation of strings and Micropython
Steven D'Aprano
steve at pearwood.info
Wed Jun 4 22:10:40 CEST 2014
On Wed, Jun 04, 2014 at 03:32:25PM +0000, Steve Dower wrote:
> Steven D'Aprano wrote:
> > The language semantics says that a string is an array of code points. Every
> > index relates to a single code point, no code point extends over two or more
> > indexes.
> > There's a 1:1 relationship between code points and indexes. How is direct
> > indexing "likely to be incorrect"?
>
> We're discussing the behaviour under a different (hypothetical) design
> decision than a 1:1 relationship between code points and indexes, so
> arguing from that stance doesn't make much sense.
I'm open to different implementations. I earlier even suggested that the
choice of O(1) indexing versus O(N) indexing was a quality of
implementation issue, not a make-or-break issue for whether something
can call itself Python (or even 99% compatible with Python").
But I don't believe that exposing that implementation at the Python
level is valid: regardless of whether it is efficient or not, I should
be able to write code like this:
a = [mystring[i] for i in range(len(mystring))]
b = list(mystring)
assert a == b
That is not the case if you expose the underlying byte-level
implementation at the Python level, and treat strings as an array of
*bytes*. Paul seems to want to do this, or at least he wants Python 4
to do this. I think it is *completely* inappropriate to do so.
I *think* you may agree with me, (correct me if I'm wrong) because you
go on to agree with me:
> > e.g.
> >
> > s = "---ÿ---"
> > offset = s.index('ÿ')
> > assert s[offset] == 'ÿ'
> >
> > That cannot fail with Python's semantics.
>
> Agreed, and it shouldn't
but I'm not actually sure.
> (I was actually referring to the optimization
> being incorrect for the goal, not the language semantics). What you'd
> probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may
> be surprising, but is also correct.
You don't seem to be taking about sys.getsizeof, so I guess you're
talking about something at the C level (or other underlying
implementation), ignoring the object overhead. I don't know why you
think I'd find that surprising -- one cannot fit 0x10FFFF Unicode code
points in a single byte, so whether you use UTF-32, UTF-16, UTF-8,
Python 3.3's FSR or some other implementation, at least some code points
are going to use more than one byte.
> But what are you trying to achieve (why are you writing this code)?
> All this example really shows is that you're only using indexing for
> trivial purposes.
I'm trying to understand what point you are trying to make, because I'm
afraid I don't quite get it.
[...]
> If copying into a separate list is a problem (memory-wise),
> re.finditer('\\S+', string) also provides the same behaviour and gives
> me the sliced string, so there's no need to index for anything.
finditer returns a bunch of MatchObjects, which give you the indexes
of the found substring. Whether you do it yourself, or get the re
module to do it, you're indexing somewhere.
> The downside is that it isn't as easy to teach as the 1:1
> relationship, and currently it doesn't perform as well *in CPython*.
> But if MicroPython is focusing on size over speed, I don't see any
> reason why they shouldn't permit different performance characteristics
> and require a slightly different approach to highly-optimized coding.
I don't have a problem with different implementations, so long as that
implementation isn't exposed at the Python level with changes of
semantics such as breaking the promise that a string is an array of code
points, not of bytes.
> In any case, this is an interesting discussion with a genuine effect
> on the Python interpreter ecosystem. Jython and IronPython already
> have different string implementations from CPython - having official
> (and hopefully flexible) guidance on deviations from the reference
> implementation would I think help other implementations provide even
> more value, which is only a good thing for Python.
Yes, agreed.
--
Steven
More information about the Python-Dev
mailing list