String performance regression from python 3.2 to 3.3

Wed Mar 13 20:55:47 EDT 2013

On Thu, Mar 14, 2013 at 11:52 AM, MRAB <python at mrabarnett.plus.com> wrote:
> On 13/03/2013 23:43, Chris Angelico wrote:
>>
>> On Thu, Mar 14, 2013 at 3:49 AM, rusi <rustompmody at gmail.com> wrote:
>>>
>>> On Mar 13, 3:59 pm, Chris Angelico <ros... at gmail.com> wrote:
>>>>
>>>> On Wed, Mar 13, 2013 at 9:11 PM, rusi <rustompm... at gmail.com> wrote:
>>>> > Uhhh..
>>>> > Making the subject line useful for all readers
>>>>
>>>> I should have read this one before replying in the other thread.
>>>>
>>>> jmf, I'd like to see evidence that there has been a performance
>>>> regression compared against a wide build of Python 3.2. You still have
>>>> never answered this fundamental, that the narrow builds of Python are
>>>> *BUGGY* in the same way that JavaScript/ECMAScript is. And believe you
>>>> me, the utterly unnecessary hassles I have had to deal with when
>>>> permitting user-provided .js code to script my engine have wasted
>>>> rather more dev hours than you would believe - there are rather a lot
>>>> of stupid edge cases to deal with.
>>>
>>>
>>> This assumes that there are only three choices:
>>> - narrow build that is buggy (surrogate pairs for astral characters)
>>> - wide build that is 4-fold space inefficient for wide variety of
>>> common (ASCII) use-cases
>>> - flexible string engine that chooses a small tradeoff of space
>>> efficiency over time efficiency.
>>>
>>> There is a fourth choice: narrow build that chooses to be partial over
>>> being buggy. ie when an astral character is encountered, an exception
>>> is thrown rather than trying to fudge it into a 16-bit
>>> representation.
>>
>>
>> As a simple factual matter, narrow builds of Python 3.2 don't do that.
>> So it doesn't factor into my original statement. But if you're talking
>> about a proposal for 3.4, then sure, that's a theoretical possibility.
>> It wouldn't be "buggy" in the sense of "string indexing/slicing
>> unexpectedly does the wrong thing", but it would still be incomplete
>> Unicode support, and I don't think people would appreciate it. Much
>> better to have graceful degradation: if there are non-BMP characters
>> in the string, then instead of throwing an exception, it just makes
>> the string wider.
>>
> [snip]
> Do you mean that instead of switching between 1/2/4 bytes per codepoint
> it would switch between 2/4 bytes per codepoint?

That's my point. We already have the better version. :)

ChrisA