String performance regression from python 3.2 to 3.3

Wed Mar 13 19:43:58 EDT 2013

On Thu, Mar 14, 2013 at 3:49 AM, rusi <rustompmody at gmail.com> wrote:
> On Mar 13, 3:59 pm, Chris Angelico <ros... at gmail.com> wrote:
>> On Wed, Mar 13, 2013 at 9:11 PM, rusi <rustompm... at gmail.com> wrote:
>> > Uhhh..
>> > Making the subject line useful for all readers
>>
>> I should have read this one before replying in the other thread.
>>
>> jmf, I'd like to see evidence that there has been a performance
>> regression compared against a wide build of Python 3.2. You still have
>> never answered this fundamental, that the narrow builds of Python are
>> *BUGGY* in the same way that JavaScript/ECMAScript is. And believe you
>> me, the utterly unnecessary hassles I have had to deal with when
>> permitting user-provided .js code to script my engine have wasted
>> rather more dev hours than you would believe - there are rather a lot
>> of stupid edge cases to deal with.
>
> This assumes that there are only three choices:
> - narrow build that is buggy (surrogate pairs for astral characters)
> - wide build that is 4-fold space inefficient for wide variety of
> common (ASCII) use-cases
> - flexible string engine that chooses a small tradeoff of space
> efficiency over time efficiency.
>
> There is a fourth choice: narrow build that chooses to be partial over
> being buggy. ie when an astral character is encountered, an exception
> is thrown rather than trying to fudge it into a 16-bit
> representation.

As a simple factual matter, narrow builds of Python 3.2 don't do that.
So it doesn't factor into my original statement. But if you're talking
about a proposal for 3.4, then sure, that's a theoretical possibility.
It wouldn't be "buggy" in the sense of "string indexing/slicing
unexpectedly does the wrong thing", but it would still be incomplete
Unicode support, and I don't think people would appreciate it. Much
better to have graceful degradation: if there are non-BMP characters
in the string, then instead of throwing an exception, it just makes
the string wider.

> I am hardly a unicode expert, my impression is this: While in today's
> internationalized world, going back to ASCII is not an option, most
> actual uses of unicode stay within the BMP

That's a valid line of argument for an optimization, but not for a
hard limitation. A general-purpose language, function, system,
whatever, will need to cope with astral characters at some point; it
just won't need them *often*.

> Further if the choice is not between two python executables but
> between string-engines chosen at startup by command-line switches or
> equivalent, the price may be quite small.

It's complexity cost, though, and people would need to know when it
would be worth giving Python that switch to change its string format.
Plus, every C extension would need to cope with both formats. I
personally doubt it'd be worth it, but if you want to knock together a
patched CPython and get some timing stats, I'm sure this list or
python-dev will be happy to discuss the matter. :)

ChrisA