RE Module Performance

Fri Jul 12 12:21:29 EDT 2013

On Fri, Jul 12, 2013 at 8:45 PM, Devyn Collier Johnson
<devyncjohnson at gmail.com> wrote:
> Could you explain what you mean? What and where is the new Flexible String
> Representation?

(You're top-posting again. Please put your text underneath what you're
responding to - it helps maintain flow and structure.)

Python versions up to and including 3.2 came in two varieties: narrow
builds (commonly found on Windows) and wide builds (commonly found on
Linux). Narrow builds internally represented Unicode strings in
UTF-16, while wide builds used UTF-32. This is a problem, because it
means that taking a program from one to another actually changes its
behaviour:

Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u"\U00012345")
1

Python 2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win32
>>> len(u"\U00012345")
2

In fact, the narrow builds are flat-out buggy, because you can put
something in as a single character that simply isn't a single
character. You can then pull that out as two characters and make a
huge mess of things:

>>> s=u"\U00012345"
>>> s[0]
u'\ud808'
>>> s[1]
u'\udf45'

*Any* string indexing will be broken if there is a single character
>U+FFFF ahead of it in the string.

Now, this problem is not unique to Python. Heaps of other languages
have the same issue, the same buggy behaviour, the same compromises.
What's special about Python is that it actually managed to come back
from that problem. (Google's V8 JavaScript engine, for instance, is
stuck with it, because the ECMAScript specification demands UTF-16. I
asked on an ECMAScript list and was told "can't change that, it'd
break code". So it's staying buggy.)

There are a number of languages that take the Texan RAM-guzzling
approach of storing all strings in UTF-32; Python (since version 3.3)
is among a *very* small number of languages that store strings in
multiple different ways according to their content. That's described
in PEP 393 [1], titled "Flexible String Representation". It details a
means whereby a Python string will be represented in, effectively,
UTF-32 with some of the leading zero bytes elided. Or if you prefer,
in either Latin-1, UCS-2, or UCS-4, whichever's the smallest it can
fit in. The difference between a string stored one-byte-per-character
and a string stored four-bytes-per-character is almost invisible to a
Python script; you can find out by checking the string's memory usage,
but otherwise you don't need to worry about it.

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
>>> sys.getsizeof("asdfasdfasdfasd")
40
>>> sys.getsizeof("asdfasdfasdfasdf")
41

Adding another character adds another 1 byte. (There's quite a bit of
overhead for small strings - GC headers and such - but it gets dwarfed
by the actual content after a while.)

>>> sys.getsizeof("\u1000sdfasdfasdfasd")
68
>>> sys.getsizeof("\u1000sdfasdfasdfasdf")
70

Two bytes to add another character.

>>> sys.getsizeof("\U00010001sdfasdfasdfasd")
100
>>> sys.getsizeof("\U00010001sdfasdfasdfasdf")
104

Four bytes. It uses only what it needs.

Strings in Python are immutable, so there's no need to worry about
up-grading or down-grading a string; there are a few optimizations
that can't be done, but they're fairly trivial. Look, I'll pull a jmf
and find a microbenchmark that makes 3.3 look worse:

2.7.4:
>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"')
[0.8175005482540385, 0.789617954237201, 0.8152240019332098]
>>> timeit.repeat('a=u"A"*100; a+=u"a"')
[0.8088905154146744, 0.8123691698246631, 0.8172558244134365]

3.3.0:
>>> timeit.repeat('a=u"A"*100; a+=u"\u1000"')
[0.9623714745976031, 0.970628669281723, 0.9696310564468149]
>>> timeit.repeat('a=u"A"*100; a+=u"a"')
[0.7017891938739922, 0.7024725209339522, 0.6989539173082449]

See? It's clearly worse on the newer Python! But actually, this is an
extremely unusual situation, and 3.3 outperforms 2.7 on the more
common case (where the two strings are of the same width).

Python's PEP 393 strings are following the same sort of model as the
native string type in a semantically-similar but
syntactically-different language, Pike. In Pike (also free software,
like Python), the string type can be indexed character by character,
and each character can be anything in the Unicode range; and just as
in Python 3.3, memory usage goes up by just one byte if every
character in the string fits inside 8 bits. So it's not as if this is
an untested notion; Pike has been running like this for years (I don't
know how long it's had this functionality, but it won't be more than
18 years as Unicode didn't have multiple planes until 1996), and
performance has been *just fine* for all that time. Pike tends to be
run on servers, so memory usage and computation speed translate fairly
directly into TPS. And there are some sizeable commercial entities
using and developing Pike, so if the flexible string representation
had turned out to be a flop, someone would have put in the coding time
to rewrite it by now.

And yet, despite all these excellent reasons for moving to this way of
doing strings, jmf still sees his microbenchmarks as more important,
and so he jumps in on threads like this to whine about how Python 3.3
is somehow US-centric because it more efficiently handles the entire
Unicode range. I'd really like to take some highlights from Python and
Pike and start recommending that other languages take up the ideas,
but to be honest, I hesitate to inflict jmf on them all. ECMAScript
may have the right idea after all - stay with UTF-16 and avoid
answering jmf's stupid objections every week.

[1] http://www.python.org/dev/peps/pep-0393/

ChrisA