String performance regression from python 3.2 to 3.3

Thomas 'PointedEars' Lahn PointedEars at web.de
Fri Mar 15 22:44:35 EDT 2013


Chris Angelico wrote:

> Thomas 'PointedEars' Lahn […] wrote:
>> Chris Angelico wrote:
>>> jmf, I'd like to see evidence that there has been a performance
>>> regression compared against a wide build of Python 3.2. You still have
>>> never answered this fundamental, that the narrow builds of Python are
>>> *BUGGY* in the same way that JavaScript/ECMAScript is.
>>
>> Interesting.  From my work I was under the impression that I knew
>> ECMAScript and its implementations fairly well, yet I have never heard of
>> this before.
>>
>> What do you mean by “narrow build” and “wide build” and what exactly is
>> the bug “narrow builds” of Python 3.2 have in common with
>> JavaScript/ECMAScript? To which implementation of ECMAScript are you
>> referring – or are you referring to the Specification as such?
> 
> The ECMAScript spec says that strings are stored and represented in
> UTF-16.

No, it does not (which Edition?).  It says in Edition 5.1:

| 8.4 The String Type
| 
| The String type is the set of all finite ordered sequences of zero or more
| 16-bit unsigned integer values (“elements”). […] Each element is regarded
| as occupying a position within the sequence. These positions are indexed
| with nonnegative integers.  The first element (if any) is at position 0,
| the next element (if any) at position 1, and so on. The length of a
| String is the number of elements (i.e., 16-bit values) within it. 
| 
| […]
| When a String contains actual textual data, each element is considered to
| be a single UTF-16 code unit.  Whether or not this is the actual storage
| format of a String, the characters within a String are numbered by
| their initial code unit element position as though they were represented
| using UTF-16. All operations on Strings (except as otherwise stated) treat
| them as sequences of undifferentiated 16-bit unsigned integers; they do
| not ensure the resulting String is in normalised form, nor do they ensure
| language-sensitive results.
| 
| NOTE
| The rationale behind this design was to keep the implementation of Strings
| as simple and high-performing as possible. The intent is that textual data
| coming into the execution environment from outside (e.g., user input, text
| read from a file or received over the network, etc.) be converted to
| Unicode Normalised Form C before the running program sees it. Usually this
| would occur at the same time incoming text is converted from its original
| character encoding to Unicode (and would impose no additional overhead).
| Since it is recommended that ECMAScript source code be in Normalised Form
| C, string literals are guaranteed to be normalised (if source text is
| guaranteed to be normalised), as long as they do not contain any Unicode
| escape sequences.

> You can see the same thing in Javascript too. Here's a little demo I
> just knocked together:
> 
> <script>
> function foo()
> {
> var txt=document.getElementById("in").value;
> var msg="";
> for (var i=0;i<txt.length;++i) msg+="["+i+"]: "+txt.charCodeAt(i)+"
> "+txt.charCodeAt(i).toString(16)+"\n";
> document.getElementById("out").value=msg;
> }
> </script>
> <input id=in><input type=button onclick="foo()"
> value="Show"><br><textarea id=out rows=25 cols=80></textarea>

What an awful piece of code.

> Give it an ASCII string

You mean a string of Unicode characters that can also be represented with 
the US-ASCII encoding.  There are no "ASCII strings" in conforming 
ECMAScript implementations.  And a string of Unicode characters with code 
points within the BMP will suffice already.

> and you'll see, as expected, one index (based on string indexing or
> charCodeAt, same thing) for each character. Same if it's all BMP. But put
> an astral character in and you'll see 00.00.d8.00/24 (oh wait, CIDR
> notation doesn't work in Unicode) come up. I raised this issue on the
> Google V8 list and on the ECMAScript list es-discuss at mozilla.org, and was
> basically told that since JavaScript has been buggy for so long, there's
> no chance of ever making it bug-free:
> 
> https://mail.mozilla.org/pipermail/es-discuss/2012-December/027384.html

You misunderstand, and I am not buying Rick's answer.  The problem is not 
that String values are defined as units of 16 bits.  The problem is that the 
length of a primitive String value in ECMAScript, and the position of a 
character, is defined in terms of 16-bit units instead of characters.  There 
is no bug, because ECMAScript specifies that Unicode characters beyond the 
Basic Multilingual Plane (BMP) need not be supported:

| 2 Conformance
| 
| A conforming implementation of this Standard shall interpret characters in
| conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
| 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
| implementation level 3. If the adopted ISO/IEC 10646-1 subset is not
| otherwise specified, it is presumed to be the BMP subset, collection 300.
| If the adopted encoding form is not otherwise specified, it presumed to
| be the UTF-16 encoding form.

But they can:

| A conforming implementation of ECMAScript is permitted to provide
| additional types, values, objects, properties, and functions beyond those
| described in this specification. In particular, a conforming 
| implementation of ECMAScript is permitted to provide properties not
| described in this specification, and values for those properties, for
| objects that are described in this specification.
| 
| A conforming implementation of ECMAScript is permitted to support program
| and regular expression syntax not described in this specification. In
| particular, a conforming implementation of ECMAScript is permitted to
| support program syntax that makes use of the “future reserved words”
| listed in 7.6.1.2 of this specification.

People have found ways to make this work in ECMAScript implementations.  For 
example, it is possible to scan a normalized string for lead surrogates:

  String.fromCharCode = (function () {
    var _fromCharCode = String.fromCharCode;
    var span;
    
    return function () {
      var a = [];       

      for (var i = 0, len = arguments.length; i < len; ++i)
      {
        var arg = arguments[i];
        var ch;

        if (arg > 0xFFFF)
        {
          if (typeof span == "undefined")
          {
            span = document.createElement("span");
          }

          span.innerHTML = "&#" + arg + ";";
          ch = span.firstChild.nodeValue;
        }
        else
        {
          ch = _fromCharCode(arg);
        }

        a.push(ch);
      }

      return a.join("");
    };
  }());

  /* "𝄢" (U+1D122 MUSICAL SYMBOL F CLEF) */
  var sFClef = String.fromCharCode(0x1D122);

  String.prototype.getLength = function () {
    return (this.match(/[\uD800-\uDBFF][^\uD800-\uDBFF]|[\S\s]/g)
                       || []).length;
  };

  /* 1 */
  sFClef.getLength()

(String.prototype.charAt() etc. are left as an exercise to the reader.)

Tested in Chromium 25.0.1364.160 Debian 7.0 (186726), which according to 
Wikipedia should feature V8 3.15.11.5.

But yes, there should be native support for Unicode characters with code 
points beyond the BMP, and evidently that does _not_ require a second 
language; just a few tweaks to the algorithms.
 
> Fortunately for Python, there are version numbers, and policies that
> permit bugs to actually get fixed. (Which is why, for instance, Debian
> Squeeze still ships Python 2.6 rather than upgrading to 2.7 - in case
> some script is broken by that change.

Debian already ships Python 3.1 in Stable, disproving your argument.

> Can't do that with web browsers.)

Yes, you could.  It has been done before.

> As of Python 3.3, all Pythons function the same way: it's
> semantically a "wide build" (UTF-32), but with a memory usage
> optimization. That's how it needs to be.

It is _not_ necessary to use the memory-expensive UTF-32 or a memory-cheaper 
mixed encoding to represent characters beyond the BMP.  UTF-32 would be more 
runtime-efficient than any other encoding for such strings, though, because 
you could divide by 32 for the length and would not have to find lead 
surrogates to determine a character's position.

-- 
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.



More information about the Python-list mailing list