String performance regression from python 3.2 to 3.3

Wed Mar 13 20:19:11 EDT 2013

On Thu, Mar 14, 2013 at 4:42 AM, Thomas 'PointedEars' Lahn
<PointedEars at web.de> wrote:
> Chris Angelico wrote:
>
>> On Wed, Mar 13, 2013 at 9:11 PM, rusi <rustompmody at gmail.com> wrote:
>>> Uhhh..
>>> Making the subject line useful for all readers
>>
>> I should have read this one before replying in the other thread.
>>
>> jmf, I'd like to see evidence that there has been a performance
>> regression compared against a wide build of Python 3.2. You still have
>> never answered this fundamental, that the narrow builds of Python are
>> *BUGGY* in the same way that JavaScript/ECMAScript is.
>
> Interesting.  From my work I was under the impression that I knew ECMAScript
> and its implementations fairly well, yet I have never heard of this before.
>
> What do you mean by “narrow build” and “wide build” and what exactly is the
> bug “narrow builds” of Python 3.2 have in common with JavaScript/ECMAScript?
> To which implementation of ECMAScript are you referring – or are you
> referring to the Specification as such?

The ECMAScript spec says that strings are stored and represented in
UTF-16. Python versions up to 3.2 came in two varieties: narrow, which
included (I believe) the Windows builds available on python.org, and
wide, which was (again, I think) the default Linux config. The problem
predates Python 3 and its default string being Unicode - the Py2
unicode type has the same issue:

Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit
(Intel)] on win32
>>> u"\U00012345"
u'\U00012345'
>>> len(_)
2

Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\U00012345"
u'\U00012345'
>>> len(_)
1

That's the Python msi installer, and the default system Python from an
Ubuntu 10.10. The exact same code does different things on different
platforms, and on the Windows (narrow-build), it's possible to split
surrogates:

>>> u"\U00012345"[0]
u'\ud808'
>>> u"\U00012345"[1]
u'\udf45'

You can see the same thing in Javascript too. Here's a little demo I
just knocked together:

<script>
function foo()
{
	var txt=document.getElementById("in").value;
	var msg="";
	for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
"+txt.charCodeAt(i).toString(16)+"\n";
	document.getElementById("out").value=msg;
}
</script>
<input id=in><input type=button onclick="foo()"
value="Show"><br><textarea id=out rows=25 cols=80></textarea>

Give it an ASCII string and you'll see, as expected, one index (based
on string indexing or charCodeAt, same thing) for each character. Same
if it's all BMP. But put an astral character in and you'll see
00.00.d8.00/24 (oh wait, CIDR notation doesn't work in Unicode) come
up. I raised this issue on the Google V8 list and on the ECMAScript
list es-discuss at mozilla.org, and was basically told that since
JavaScript has been buggy for so long, there's no chance of ever
making it bug-free:

https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html

Fortunately for Python, there are version numbers, and policies that
permit bugs to actually get fixed. (Which is why, for instance, Debian
Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case
some script is broken by that change. Can't do that with web
browsers.) As of Python 3.3, all Pythons function the same way: it's
semantically a "wide build" (UTF-32), but with a memory usage
optimization. That's how it needs to be.

ChrisA