String performance regression from python 3.2 to 3.3

Fri Mar 15 23:59:47 EDT 2013

On Sat, Mar 16, 2013 at 1:44 PM, Thomas 'PointedEars' Lahn
<PointedEars at web.de> wrote:
> Chris Angelico wrote:
>> The ECMAScript spec says that strings are stored and represented in
>> UTF-16.
>
> No, it does not (which Edition?).  It says in Edition 5.1:

Okay, I was sloppy in my terminology. A language will seldom, if ever,
specify the actual storage. But it does specify a representation (to
the script) of UTF-16, and I seriously cannot imagine any reason for
an implementation to store a string in any other way, given that
string indexing is specifically based on UTF-16:

> | The length of a
> | String is the number of elements (i.e., 16-bit values) within it.
> |
> | […]
> | When a String contains actual textual data, each element is considered to
> | be a single UTF-16 code unit.  Whether or not this is the actual storage
> | format of a String, the characters within a String are numbered by
> | their initial code unit element position as though they were represented
> | using UTF-16.

So, yes, it could be stored in some other way, but in terms of what I
was saying (comparing against Python 3.2 and 3.3), it's still a
specification that doesn't allow for the change that Python did. If
narrow builds are all you compare against (as jmf does), then Python
3.2 is exactly like ECMAScript, and Python 3.3 isn't.

>> You can see the same thing in Javascript too. Here's a little demo I
>> just knocked together:
>>
>> <script>
>> function foo()
>> {
>> var txt=document.getElementById("in").value;
>> var msg="";
>> for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
>> "+txt.charCodeAt(i).toString(16)+"\n";
>> document.getElementById("out").value=msg;
>> }
>> </script>
>> <input id=in><input type=button onclick="foo()"
>> value="Show"><br><textarea id=out rows=25 cols=80></textarea>
>
> What an awful piece of code.

Ehh, it's designed to be short, not beautiful. Got any serious
criticisms of it? It demonstrates what I'm talking about without being
a page of code.

>> Give it an ASCII string
>
> You mean a string of Unicode characters that can also be represented with
> the US-ASCII encoding.  There are no "ASCII strings" in conforming
> ECMAScript implementations.  And a string of Unicode characters with code
> points within the BMP will suffice already.

You can get a string of ASCII characters and paste them into the entry
field. They'll be turned into Unicode characters before the script
sees them. But yes, okay, my terminology was a bit sloppy.

>> and you'll see, as expected, one index (based on string indexing or
>> charCodeAt, same thing) for each character. Same if it's all BMP. But put
>> an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR
>> notation doesn't work in Unicode) come up. I raised this issue on the
>> Google V8 list and on the ECMAScript list es-discuss at mozilla.org, and was
>> basically told that since JavaScript has been buggy for so long, there's
>> no chance of ever making it bug-free:
>>
>> https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html
>
> You misunderstand, and I am not buying Rick's answer.  The problem is not
> that String values are defined as units of 16 bits.  The problem is that the
> length of a primitive String value in ECMAScript, and the position of a
> character, is defined in terms of 16-bit units instead of characters.  There
> is no bug, because ECMAScript specifies that Unicode characters beyond the
> Basic Multilingual Plane (BMP) need not be supported:

So what you're saying is that an ES implementation is allowed to be
even buggier than I described, and that's somehow a justification?

> People have found ways to make this work in ECMAScript implementations.  For
> example, it is possible to scan a normalized string for lead surrogates:

And it's possible to write a fully conforming Unicode handler in C,
using char[] and relying on (say) UTF-8 encoding. That has nothing to
do with the language actually providing facilities.

> But yes, there should be native support for Unicode characters with code
> points beyond the BMP, and evidently that does _not_ require a second
> language; just a few tweaks to the algorithms.

No, it requires either a complete change of the language, or the
acceptance that O(1) operations can now become O(n) on the length of
the string (if the string is left in UTF-16 but indexed in Unicode),
or the creation of a new user-space data type (which then has to be
converted any time it's given to any standard library function).

>> Fortunately for Python, there are version numbers, and policies that
>> permit bugs to actually get fixed. (Which is why, for instance, Debian
>> Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case
>> some script is broken by that change.
>
> Debian already ships Python 3.1 in Stable, disproving your argument.

Separate branch. Debian stable ships one from each branch; Debian
unstable does, too (2.7.3 and 3.2.3). Same argument applies to each,
though - even Debian unstable hasn't yet introduced Python 3.3, in
case it breaks stuff. Argument not disproved.

>> Can't do that with web browsers.)
>
> Yes, you could.  It has been done before.

Not easily. Assuming you can't make one a perfect super/subset of the
other (as with "use strict"), it needs to be done as a completely
separate language. Now, maybe it's time the <script> tag got versioned
(again? what happened to language="javascript1.2"?), but the normal
way for scripts to be put onto a page doesn't allow version tagging,
and especially, embedding code into other attributes doesn't make that
easy:

<button onclick="foo()">Which version?</button>

>> As of Python 3.3, all Pythons function the same way: it's
>> semantically a "wide build" (UTF-32), but with a memory usage
>> optimization. That's how it needs to be.
>
> It is _not_ necessary to use the memory-expensive UTF-32 or a memory-cheaper
> mixed encoding to represent characters beyond the BMP.  UTF-32 would be more
> runtime-efficient than any other encoding for such strings, though, because
> you could divide by 32 for the length and would not have to find lead
> surrogates to determine a character's position.

Of course it's not the only way to represent all of Unicode. But when
you provide string indexing (charCodeAt), programmers will assume it
is cheap, and casually index strings (from both ends and maybe the
middle too). A system in which string indexing isn't O(1) is going to
perform highly suboptimally with many common string operations, so its
programmers would be forced to learn its foibles or pay the cost.

ChrisA