RE Module Performance

Wed Jul 24 14:15:42 EDT 2013

On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 7/24/2013 11:00 AM, Michael Torrie wrote:
>>
>> On 07/24/2013 08:34 AM, Chris Angelico wrote:
>>>
>>> Frankly, Python's strings are a *terrible* internal representation
>>> for an editor widget - not because of PEP 393, but simply because
>>> they are immutable, and every keypress would result in a rebuilding
>>> of the string. On the flip side, I could quite plausibly imagine
>>> using a list of strings;
>
>
> I used exactly this, a list of strings, for a Python-coded text-only mock
> editor to replace the tk Text widget in idle tests. It works fine for the
> purpose. For small test texts, the inefficiency of immutable strings is not
> relevant.
>
> Tk apparently uses a C-coded btree rather than a Python list. All details
> are hidden, unless one finds and reads the source ;-), but but it uses C
> arrays rather than Python strings.
>
>
>>> In this usage, the FSR is beneficial, as it's possible to have
>>> different strings at different widths.
>
>
> For my purpose, the mock Text works the same in 2.7 and 3.3+.

Thanks for that report! And yes, it's going to behave exactly the same
way, because its underlying structure is an ordered list of ordered
lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
performance. But if you put your code onto a narrow build, you'll have
issues as seen below.

>> Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
>> pros and cons,
>
> They both have the pro that indexing is direct *and correct*. The cons are
> different.

They're close enough, though. It's simply a performance tradeoff - use
the memory all the time, or take a bit of overhead to give yourself
the option of using less memory. The difference is negligible compared
to...

>> and the cons of using UCS-2 (the old narrow builds) are
>> well known.  UCS-2 simply cannot represent all of unicode correctly.
>
> Python's narrow builds, at least for several releases, were in between USC-2
> and UTF-16 in that they used surrogates to represent all unicodes but did
> not correct indexing for the presence of astral chars. This is a nuisance
> for those who do use astral chars, such as emotes and CJK name chars, on an
> everyday basis.

... this. If nobody had ever thought of doing a multi-format string
representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript)
decides to move from UTF-16 to UTF-32, I would wholeheartedly support
the move, even if it broke code to do so. To my mind, exposing UTF-16
surrogates to the application is a bug to be fixed, not a feature to
be maintained. But since we can get the best of both worlds with only
a small amount of overhead, I really don't see why anyone should be
objecting.

ChrisA