String performance regression from python 3.2 to 3.3

Sat Mar 16 00:09:56 EDT 2013

On Sat, Mar 16, 2013 at 2:56 PM, Mark Lawrence <breamoreboy at yahoo.co.uk> wrote:
> On 16/03/2013 02:44, Thomas 'PointedEars' Lahn wrote:
>>
>> Chris Angelico wrote:
>>
>
> Thomas and Chris, would the two of you be kind enough to explain to morons
> such as myself how all the ECMAScript stuff relates to Python's unicode as
> implemented via PEP 393 as you've lost me, easily done I know.

Sure. Here's the brief version: It's all about how a string is exposed
to a script.

* Python 3.2 Narrow gives you UTF-16. Non-BMP characters count twice.
* Python 3.2 Wide gives you UTF-32. Each character counts once.
* Python 3.3 gives you UTF-32, but will store it as compactly as possible.
* ECMAScript specifies the semantics of Python 3.2 Narrow.

Python 3.2 was either buggy or inefficient. (Generally, Windows builds
were buggy and Linux builds were inefficient, but you could pick at
compilation time.) String indexing followed obvious rules, as long as
everything fitted inside UCS-2, or you paid the
four-bytes-per-character price of a wide build. Otherwise, stuff went
off-kilter. PEP 393 fixed the matter, and the arguments were about
implementation, efficiency, and so on - but (far as I know) nobody
ever argued that the semantics of UTF-16 strings should be kept.
That's the difference with ES - that behaviour, peculiar though it be,
is actually mandated by the spec. I have banged my head against it at
work (amazingly, PHP's complete lack of native Unicode support is
actually easier to work with there - though mainly I just throw the
stuff at PostgreSQL, which will throw an error back if anything's
wrong); it's an insane mandate. But it's part of the spec, and it
can't be changed now.

ChrisA