String performance regression from python 3.2 to 3.3
Thomas 'PointedEars' Lahn
PointedEars at web.de
Sat Mar 16 00:12:44 EDT 2013
Chris Angelico wrote:
> On Sat, Mar 16, 2013 at 1:44 PM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>> Chris Angelico wrote:
>>> The ECMAScript spec says that strings are stored and represented in
>>> UTF-16.
>>
>> No, it does not (which Edition?). It says in Edition 5.1:
>
> Okay, I was sloppy in my terminology. A language will seldom, if ever,
> specify the actual storage. But it does specify a representation (to
> the script) of UTF-16,
No, it does not.
> and I seriously cannot imagine any reason for an implementation to store a
> string in any other way, given that string indexing is specifically based
> on UTF-16:
Non sequitur.
>> | The length of a String is the number of elements (i.e., 16-bit values)
>> | within it.
>> |
>> | […]
>> | When a String contains actual textual data, each element is considered
>> | to
>> | be a single UTF-16 code unit. Whether or not this is the actual
>> | storage format of a String, the characters within a String are numbered
>> | by their initial code unit element position as though they were
>> | represented using UTF-16.
>
> So, yes, it could be stored in some other way, but in terms of what I
> was saying (comparing against Python 3.2 and 3.3), it's still a
> specification that doesn't allow for the change that Python did.
Yes, it does. You must have not been reading or understanding what I
quoted.
>>> You can see the same thing in Javascript too. Here's a little demo I
>>> just knocked together:
>>>
>>> <script>
>>> function foo()
>>> {
>>> var txt=document.getElementById("in").value;
>>> var msg="";
>>> for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
>>> "+txt.charCodeAt(i).toString(16)+"\n";
>>> document.getElementById("out").value=msg;
>>> }
>>> </script>
>>> <input id=in><input type=button onclick="foo()"
>>> value="Show"><br><textarea id=out rows=25 cols=80></textarea>
>>
>> What an awful piece of code.
>
> Ehh, it's designed to be short, not beautiful. Got any serious
> criticisms of it?
Better not here, lest another “moron” would complain.
> It demonstrates what I'm talking about without being a page of code.
It could have been written readable and efficient without that.
>>> Give it an ASCII string
>>
>> You mean a string of Unicode characters that can also be represented with
>> the US-ASCII encoding. There are no "ASCII strings" in conforming
>> ECMAScript implementations. And a string of Unicode characters with code
>> points within the BMP will suffice already.
>
> You can get a string of ASCII characters and paste them into the entry
> field.
Not likely these days, no.
> They'll be turned into Unicode characters before the script
> sees them.
They will have become Windows-1252 or even Unicode characters long before.
> But yes, okay, my terminology was a bit sloppy.
It still is.
>>> and you'll see, as expected, one index (based on string indexing or
>>> charCodeAt, same thing) for each character. Same if it's all BMP. But
>>> put an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR
>>> notation doesn't work in Unicode) come up. I raised this issue on the
>>> Google V8 list and on the ECMAScript list es-discuss at mozilla.org, and
>>> was basically told that since JavaScript has been buggy for so long,
>>> there's no chance of ever making it bug-free:
>>>
>>> https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html
>>
>> You misunderstand, and I am not buying Rick's answer. The problem is not
>> that String values are defined as units of 16 bits. The problem is that
>> the length of a primitive String value in ECMAScript, and the position of
>> a character, is defined in terms of 16-bit units instead of characters.
>> There is no bug, because ECMAScript specifies that Unicode characters
>> beyond the Basic Multilingual Plane (BMP) need not be supported:
>
> So what you're saying is that an ES implementation is allowed to be
> even buggier than I described, and that's somehow a justification?
No, I am saying that you have no clue what you are talking about.
>> But yes, there should be native support for Unicode characters with code
>> points beyond the BMP, and evidently that does _not_ require a second
>> language; just a few tweaks to the algorithms.
>
> No, it requires either a complete change of the language, […]
No, it does not. Get yourself informed.
>>> Can't do that with web browsers.)
>>
>> Yes, you could. It has been done before.
>
> Not easily.
You have still no clue what you are talking about. Get yourself informed at
least about the (deprecated/obsolete) “language” and the (standards-
compliant) “type” attribute of SCRIPT/“script” elements before you post on
this again.
--
PointedEars
More information about the Python-list
mailing list