String performance regression from python 3.2 to 3.3

Wed Mar 13 22:01:35 EDT 2013

On 14/03/2013 00:55, Chris Angelico wrote:
> On Thu, Mar 14, 2013 at 11:52 AM, MRAB <python at mrabarnett.plus.com> wrote:
>> On 13/03/2013 23:43, Chris Angelico wrote:
>>>
>>> On Thu, Mar 14, 2013 at 3:49 AM, rusi <rustompmody at gmail.com> wrote:
>>>>
>>>> On Mar 13, 3:59 pm, Chris Angelico <ros... at gmail.com> wrote:
>>>>>
>>>>> On Wed, Mar 13, 2013 at 9:11 PM, rusi <rustompm... at gmail.com> wrote:
>>>>> > Uhhh..
>>>>> > Making the subject line useful for all readers
>>>>>
>>>>> I should have read this one before replying in the other thread.
>>>>>
>>>>> jmf, I'd like to see evidence that there has been a performance
>>>>> regression compared against a wide build of Python 3.2. You still have
>>>>> never answered this fundamental, that the narrow builds of Python are
>>>>> *BUGGY* in the same way that JavaScript/ECMAScript is. And believe you
>>>>> me, the utterly unnecessary hassles I have had to deal with when
>>>>> permitting user-provided .js code to script my engine have wasted
>>>>> rather more dev hours than you would believe - there are rather a lot
>>>>> of stupid edge cases to deal with.
>>>>
>>>>
>>>> This assumes that there are only three choices:
>>>> - narrow build that is buggy (surrogate pairs for astral characters)
>>>> - wide build that is 4-fold space inefficient for wide variety of
>>>> common (ASCII) use-cases
>>>> - flexible string engine that chooses a small tradeoff of space
>>>> efficiency over time efficiency.
>>>>
>>>> There is a fourth choice: narrow build that chooses to be partial over
>>>> being buggy. ie when an astral character is encountered, an exception
>>>> is thrown rather than trying to fudge it into a 16-bit
>>>> representation.
>>>
>>>
>>> As a simple factual matter, narrow builds of Python 3.2 don't do that.
>>> So it doesn't factor into my original statement. But if you're talking
>>> about a proposal for 3.4, then sure, that's a theoretical possibility.
>>> It wouldn't be "buggy" in the sense of "string indexing/slicing
>>> unexpectedly does the wrong thing", but it would still be incomplete
>>> Unicode support, and I don't think people would appreciate it. Much
>>> better to have graceful degradation: if there are non-BMP characters
>>> in the string, then instead of throwing an exception, it just makes
>>> the string wider.
>>>
>> [snip]
>> Do you mean that instead of switching between 1/2/4 bytes per codepoint
>> it would switch between 2/4 bytes per codepoint?
>
> That's my point. We already have the better version. :)
>
If a later version of Python switched between 2/4 bytes per codepoint,
how much difference would it make in terms of memory and speed compared
to Python 3.2 (fixed width) and Python 3.3 (3 widths)?

The vast majority of the time, 2 bytes per codepoint is sufficient, but
would that result in less switching between widths and therefore higher
performance, or would the use of more memory (2 bytes when 1 byte would
do) offset that?

(And I'm talking about significant differences here.)