String performance regression from python 3.2 to 3.3

Sat Mar 16 00:12:44 EDT 2013

Chris Angelico wrote:

> On Sat, Mar 16, 2013 at 1:44 PM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>> Chris Angelico wrote:
>>> The ECMAScript spec says that strings are stored and represented in
>>> UTF-16.
>>
>> No, it does not (which Edition?).  It says in Edition 5.1:
> 
> Okay, I was sloppy in my terminology. A language will seldom, if ever,
> specify the actual storage. But it does specify a representation (to
> the script) of UTF-16,

No, it does not.

> and I seriously cannot imagine any reason for an implementation to store a
> string in any other way, given that string indexing is specifically based
> on UTF-16:

Non sequitur.

>> | The length of a String is the number of elements (i.e., 16-bit values)
>> | within it.
>> |
>> | […]
>> | When a String contains actual textual data, each element is considered
>> | to
>> | be a single UTF-16 code unit.  Whether or not this is the actual
>> | storage format of a String, the characters within a String are numbered
>> | by their initial code unit element position as though they were
>> | represented using UTF-16.
> 
> So, yes, it could be stored in some other way, but in terms of what I
> was saying (comparing against Python 3.2 and 3.3), it's still a
> specification that doesn't allow for the change that Python did.

Yes, it does.  You must have not been reading or understanding what I 
quoted.

>>> You can see the same thing in Javascript too. Here's a little demo I
>>> just knocked together:
>>>
>>> <script>
>>> function foo()
>>> {
>>> var txt=document.getElementById("in").value;
>>> var msg="";
>>> for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
>>> "+txt.charCodeAt(i).toString(16)+"\n";
>>> document.getElementById("out").value=msg;
>>> }
>>> </script>
>>> <input id=in><input type=button onclick="foo()"
>>> value="Show"><br><textarea id=out rows=25 cols=80></textarea>
>>
>> What an awful piece of code.
> 
> Ehh, it's designed to be short, not beautiful. Got any serious
> criticisms of it?

Better not here, lest another “moron” would complain.

> It demonstrates what I'm talking about without being a page of code.

It could have been written readable and efficient without that.

>>> Give it an ASCII string
>>
>> You mean a string of Unicode characters that can also be represented with
>> the US-ASCII encoding.  There are no "ASCII strings" in conforming
>> ECMAScript implementations.  And a string of Unicode characters with code
>> points within the BMP will suffice already.
> 
> You can get a string of ASCII characters and paste them into the entry
> field.

Not likely these days, no.

> They'll be turned into Unicode characters before the script
> sees them.

They will have become Windows-1252 or even Unicode characters long before.

> But yes, okay, my terminology was a bit sloppy.

It still is.

>>> and you'll see, as expected, one index (based on string indexing or
>>> charCodeAt, same thing) for each character. Same if it's all BMP. But
>>> put an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR
>>> notation doesn't work in Unicode) come up. I raised this issue on the
>>> Google V8 list and on the ECMAScript list es-discuss at mozilla.org, and
>>> was basically told that since JavaScript has been buggy for so long,
>>> there's no chance of ever making it bug-free:
>>>
>>> https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html
>>
>> You misunderstand, and I am not buying Rick's answer.  The problem is not
>> that String values are defined as units of 16 bits.  The problem is that
>> the length of a primitive String value in ECMAScript, and the position of
>> a character, is defined in terms of 16-bit units instead of characters. 
>> There is no bug, because ECMAScript specifies that Unicode characters
>> beyond the Basic Multilingual Plane (BMP) need not be supported:
> 
> So what you're saying is that an ES implementation is allowed to be
> even buggier than I described, and that's somehow a justification?

No, I am saying that you have no clue what you are talking about.

>> But yes, there should be native support for Unicode characters with code
>> points beyond the BMP, and evidently that does _not_ require a second
>> language; just a few tweaks to the algorithms.
> 
> No, it requires either a complete change of the language, […]

No, it does not.  Get yourself informed.

>>> Can't do that with web browsers.)
>>
>> Yes, you could.  It has been done before.
> 
> Not easily.

You have still no clue what you are talking about.  Get yourself informed at 
least about the (deprecated/obsolete) “language” and the (standards-
compliant) “type” attribute of SCRIPT/“script” elements before you post on 
this again.

-- 
PointedEars