String performance regression from python 3.2 to 3.3

Sat Mar 16 01:25:10 EDT 2013

On Sat, 16 Mar 2013 15:09:56 +1100, Chris Angelico wrote:

> On Sat, Mar 16, 2013 at 2:56 PM, Mark Lawrence <breamoreboy at yahoo.co.uk>
> wrote:
>> On 16/03/2013 02:44, Thomas 'PointedEars' Lahn wrote:
>>>
>>> Chris Angelico wrote:
>>>
>>>
>> Thomas and Chris, would the two of you be kind enough to explain to
>> morons such as myself how all the ECMAScript stuff relates to Python's
>> unicode as implemented via PEP 393 as you've lost me, easily done I
>> know.
> 
> Sure. Here's the brief version: It's all about how a string is exposed
> to a script.
> 
> * Python 3.2 Narrow gives you UTF-16. Non-BMP characters count twice. 
> * Python 3.2 Wide gives you UTF-32. Each character counts once. 
> * Python 3.3 gives you UTF-32, but will store it as compactly as
> possible. 
> * ECMAScript specifies the semantics of Python 3.2 Narrow.

And just for the record:

Unicode actually doesn't define what a "character" is. Instead, it talks 
about "code points", but for our purposes we can gloss over the 
differences and pretend that they are almost the same thing, except where 
noted. Some code points represent characters, some represent non-
characters, and some are currently unused.

UTF-16 is a *variable length* storage mechanism that says each code point 
takes either two or four bytes. Since Unicode includes more code points 
than will fit in two bytes, UTF-16 includes a mechanism for dealing with 
the additional code points:

* The first 65,536 code points are defined as the "Basic Multilingual 
Plane", or BMP. Each code point in the BMP is represented in UTF-16 by a 
16-bit value.

* The remaining 16 sets of 65,536 code points are defined as 
"Supplementary Multilingual Planes", or SMPs. Each code point in a SMP is 
represented by two 16 bit values, called a "surrogate pair". The "lead 
surrogate" will be in the range 0xD800...0xDBFF and the "trail surrogate" 
will be in the range 0xDC00...0xDFFF.

The disadvantage here is that you can't tell how far into a string you 
need to go to get to (say) the 1000th character (code point). If all of 
the first 1000 code points are in the BMP, then you can jump directly to 
byte offset 2000. If all of them are in a SMP, then you can jump directly 
to byte offset 4000. But since you don't usually know how many SMP code 
points are in the string, you have to walk through the string:

# Pseudo-code to return the code-point in string at index.
offset = 0  # Offset in pairs of bytes.
counter = 0
while offset < length of string counted in pairs of bytes:
    if string[offset] in 0xD800...0xDBFF:
        # Lead surrogate of a surrogate pair.
        if counter == index:
            return string[offset:offset+1]
        else:
            counter += 1
            index += 2  # Skip the next pair.
    elif string[offset] in 0xDC00...0xDFFF:
        # Trail surrogate found outside of a surrogate pair.
        raise Error
    else:
       # BMP code point.
       if counter == index:
           return string[offset]
       else:
           counter += 1
           index += 1

What a mess! Slow and annoying to get right. Not surprisingly, most 
implementations of UTF-16 don't do this, and Python is one of them. 
Instead, they assume that all code points take up the same space, and 
consequently they let you create *invalid Unicode strings* by splitting a 
surrogate pair:

This is in Python 3.2 narrow build:

py> s = chr(70000)
py> len(s)
2
py> a = s[0]
py> a == s
False
py> print(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud804' in 
position 0: surrogates not allowed

Oops! We have created an invalid Unicode string. A wide build will fix 
this, because it uses UTF-32 instead.

UTF-32 is a *fixed width* storage mechanism where every code point takes 
exactly four bytes. Since the entire Unicode range will fit in four 
bytes, that ensures that every code point is covered, and there is no 
need to walk the string every time you perform an indexing operation. But 
it means that if you're one of the 99.9% of users who mostly use 
characters in the BMP, your strings take twice as much space as 
necessary. If you only use Latin1 or ASCII, your strings take four times 
as much space as necessary.

So a Python wide build uses more memory, but gets string processing 
right. Python 3.3 has the best of both worlds, fixing the invalid string 
processing and avoiding memory waste.

One of the complications here is that often people conflate UTF-16 and 
UCS-2. It isn't entirely clear to me, but it seems that ECMAScript may be 
doing that. UCS-2 specifies that *all characters* are represented by 
*exactly* 16 bits (two bytes), and hence it is completely unable to deal 
with the Supplementary Multilingual Planes at all.

If I have read the ECMAScript spec correctly, it *completely ignores* the 
issue of surrogate pairs, and so is underspecified. Instead it talks 
about normalised form, which is entirely unrelated. Normalisation relates 
to the idea that many characters can be represented in two or more ways. 
For example, the character "ä" can be represented in at least two forms:

Normalised form: U+00E4 LATIN SMALL LETTER A WITH DIAERESIS

Canonical decomposition: U+0061 LATIN SMALL LETTER A + U+0308 COMBINING 
DIAERESIS

So both of these two strings represent the same letter, even though the 
second uses two code points and the first only one:

py> a = "\N{LATIN SMALL LETTER A WITH DIAERESIS}"
py> b = "a\N{COMBINING DIAERESIS}"

Arguably, they should print the same way too, although I think that will 
depend on how smart your terminal is, and/or on the font you use.

But I digress. The ECMAScript spec carefully specifies that it makes no 
guarantees about normalisation, which is right and proper for a language, 
but it says nothing about surrogates, and that is very poor.

So as I said, I suspect that ECMAScript is actually referring to UCS-2 
when it mentions UTF-16. If I'm right, that's pretty lousy.

-- 
Steven