String performance regression from python 3.2 to 3.3

Wed Mar 13 20:52:08 EDT 2013

On 13/03/2013 23:43, Chris Angelico wrote:
> On Thu, Mar 14, 2013 at 3:49 AM, rusi <rustompmody at gmail.com> wrote:
>> On Mar 13, 3:59 pm, Chris Angelico <ros... at gmail.com> wrote:
>>> On Wed, Mar 13, 2013 at 9:11 PM, rusi <rustompm... at gmail.com> wrote:
>>> > Uhhh..
>>> > Making the subject line useful for all readers
>>>
>>> I should have read this one before replying in the other thread.
>>>
>>> jmf, I'd like to see evidence that there has been a performance
>>> regression compared against a wide build of Python 3.2. You still have
>>> never answered this fundamental, that the narrow builds of Python are
>>> *BUGGY* in the same way that JavaScript/ECMAScript is. And believe you
>>> me, the utterly unnecessary hassles I have had to deal with when
>>> permitting user-provided .js code to script my engine have wasted
>>> rather more dev hours than you would believe - there are rather a lot
>>> of stupid edge cases to deal with.
>>
>> This assumes that there are only three choices:
>> - narrow build that is buggy (surrogate pairs for astral characters)
>> - wide build that is 4-fold space inefficient for wide variety of
>> common (ASCII) use-cases
>> - flexible string engine that chooses a small tradeoff of space
>> efficiency over time efficiency.
>>
>> There is a fourth choice: narrow build that chooses to be partial over
>> being buggy. ie when an astral character is encountered, an exception
>> is thrown rather than trying to fudge it into a 16-bit
>> representation.
>
> As a simple factual matter, narrow builds of Python 3.2 don't do that.
> So it doesn't factor into my original statement. But if you're talking
> about a proposal for 3.4, then sure, that's a theoretical possibility.
> It wouldn't be "buggy" in the sense of "string indexing/slicing
> unexpectedly does the wrong thing", but it would still be incomplete
> Unicode support, and I don't think people would appreciate it. Much
> better to have graceful degradation: if there are non-BMP characters
> in the string, then instead of throwing an exception, it just makes
> the string wider.
>
[snip]
Do you mean that instead of switching between 1/2/4 bytes per codepoint
it would switch between 2/4 bytes per codepoint?