PATCH: Speed up direct string concatenation by 20+%!

Thu Sep 28 22:07:23 EDT 2006

This is such a long posting that I've broken it out into sections.
Note that while developing this patch I discovered a Subtle Bug
in CPython, which I have discussed in its own section below.

____________
THE OVERVIEW

I don't remember where I picked it up, but I remember reading years
ago that the simple, obvious Python approach for string concatenation:
        x = "a" + "b"
is slow.  Obviously it's fine for the simple case above, but if you're
adding together twenty-five things:
        x =  a + b + c + d + ... + y
the performance is terrible, as the interpreter creates temporary
objects
for each intermediate result.  (a + b, then (a + b) + c,
then ((a + b) + c) + d, and so on.)

The suggested idiom for fast string concatenation was this:
        x = "".join([a, b, c, d, ... y])
This works fine.  But I don't like this approach, not only because it's
kind of ugly and inobvious, but because it violates the "one obvious
way
to do it" principle.

I mentioned all this to a friend of mine (Mike Thornburgh), and told
him matter-of-factly that I'd like to fix this, but the design of
Python
simply precluded it.  He said "no it doesn't!" and proceeded to
describe
in great detail an approach to permit fast string concatenation using
+.
I have implemented what he suggested--with only slight changes--and
present it to you below.  I only had to touch four files, and only did
major surgery on one.  The resulting Python passes the test suite as
well
as my unmodified locally-built Python 2.5.  (I didn't install the
external
source trees for stuff like SQLite, so it fails those tests.)

Note that times have changed; using Python 2.5, a + b + c isn't *that*
much slower than "".join([]).  But is is still slower.  Or, was--with
my
patch, a + b + c becomes *the fastest method* for string concatenation.
Hooray!

(It didn't pay off as *much* as I'd hoped, but it's still an
improvement.)

_________
THE PATCH

The core concept: adding two strings together no longer returns a pure
"string" object.  Instead, it returns a "string concatenation" object
which holds references to the two strings but does not actually
concatenate
them... yet.  The strings are concatenated only when someone requests
the
string's value, at which point it allocates all the space it needs and
renders the concatenated string all at once.

More to the point, if you add multiple strings together (a + b + c),
it *doesn't* compute the intermediate strings (a + b).

Upsides to this approach:

        * String concatenation using + is now the fastest way to
concatenate
          strings (that I know of).

        * Throw off the shackles of "".join([]), you don't need it
anymore.

        * Did I mention it was faster?

Downsides to this approach:

        * Changes how PyStringObjects are stored internally; ob_sval is
no
          longer a char[1], but a char *.  This makes each StringObject
          four bytes larger.

        * Adds another memory dereference in order to get the value of
          a string, which is a teensy-weensy slowdown.

        * Would force a recompile of all C modules that deal directly
          with string objects (which I imagine is most of them).

        * Also, *requires* that C modules use the PyString_AS_STRING()
          macro, rather than casting the object and grabbing ob_sval
          directly.  (I was pleased to see that the Python source
          was very good about using this macro; if all Python C
          modules are this well-behaved, this point is happily moot.)

        * On a related note, the file Mac/Modules/MacOS.c implies
          that there are Mac-specific Python scripts that peer
          directly into string objects.  These would have to be
          changed to understand the new semantics.

        * String concatenation objects are 36 bytes larger than
          string objects, and this space will often go unreclaimed
          after the string is rendered.

        * When rendered, string concatenation objects storing long
          strings will allocate a second buffer from the heap to
          store the string.  So this adds some minor allocation
          overhead (though this is offset by the speed gain from
          the approach overall).

        * Will definitely need some heavy review before it could
          go in, in particular I worry I got the semantics surrounding
          "interned" strings wrong.

Internally, a "string concatenation" object is the same as a "string"
object with two extra fields: an array of references to string objects
(and string concatenation objects), and a count.  Also, ob_sstate now
stores a third flag, indicating whether or not the string is a string
concatenation object.  External users don't need to worry about these
details, they just use PyString_AS_STRING() and all is well with their
world.

I've already added some optimizations to the approach:

        * If you're adding a string to an existing string concatenation
          object whose reference count is 1, who isn't full, and who
          hasn't been rendered yet, it adds the string directly to the
          existing concatenation object (rather than creating a new
one).
          This works whether the existing object is on the right or the
          left side.

        * The general case is a + b + c ..., where you're only adding
          strings to the *end*.  This will build a degenerate tree
where
          the first slot points to the child.  The tree for
concatenating
          the first twenty-two letters of the alphabet would look like
this:

                [0 1 2 3 4 5 6 7]
                 | | | | | | | |
                 | v v v v v v v
                 | p q r s t u v
                 |
                 v
                [0 1 2 3 4 5 6 7]
                 | | | | | | | |
                 | v v v v v v v
                 | i j k l m n o
                 |
                 v
                [0 1 2 3 4 5 6 7]
                 | | | | | | | |
                 v v v v v v v v
                 a b c d e f g h

          The rendering function is by necessity recursive.  However,
          it is optimized for this shape; when the tree looks like
this,
          it will be rendered purely iteratively--no recursion.

        * If, after rendering, the resulting string will be small
enough
          to fit inside the extra space normally used by the
concatenation
          object, the final string will be copied on top of this
otherwise
          now-wasted space.

______________
THE BENCHMARKS

Benchmark 1:
        def add(a, b, c, ... t): return a + b + c + ... + t
        for i in range(10000000): add("aaa", "bbb", "ccc", ..., "ttt")
Benchmark 2:
        for i in range(10000000): x = "aaa" + "bbb" + "ccc" + ... +
"ttt"
Benchmark 3:
        for i in range(10000000): x = "".join(["aaa", "bbb", "ccc",
..., "ttt"])
Benchmark 4:
        for i in range(10000000):
                x = "aaa"
                x += "bbb"
                x += "ccc"
                ...
                x += "ttt"
Benchmark 5:
        for i in range(10000000):
                array = []
                array.append("aaa")
                array.append("bbb")
                array.append("ccc")
                ...
                array.append("ttt")
                x = "".join(array)
Benchmark 6:
        for i in range(10000000):
                x = cStringIO.StringIO()
                x.write("aaa")
                x.write("bbb")
                x.write("ccc")
                ...
                x.write("ttt")

Benchmark | 2.5 release | 2.5 locally built | 2.5 modified | %
improvement
        1 |       32.1s |             32.8s |        26.7s |
22.8%
        2 |       18.7s |             18.7s |        15.1s |
23.8%
        3 |       16.1s |             16.2s |        16.6s |
-1.1%
        4 |       64.6s |             64.0s |        57.1s |
12.1%
        5 |       74.1s |             79.4s |        76.7s |
3.5%
        6 |      126.0s |  too slow, didn't bother!

______________
THE SUBTLE BUG

While developing this patch, I discovered a subtle bug in Python.
It happened only on one gruesome test in the middle of test_descr.py.
The problem was: '' == '' was returning False.

Starting on line 1116 of stringobject.c we have this:

	if (op == Py_EQ) {
		/* Supporting Py_NE here as well does not save
		   much time, since Py_NE is rarely used.  */
		if (a->ob_size == b->ob_size
		    && (a->ob_sval[0] == b->ob_sval[0]
			&& memcmp(a->ob_sval, b->ob_sval,
				  a->ob_size) == 0)) {
			result = Py_True;
		} else {
			result = Py_False;
		}
		goto out;
	}

The bug is with the "a->ob_sval[0] == b->ob_sval[0]".  This is
apparently
a speed optimization; check that the first byte is the same before
proceeding with the memcmp(). But it does this even when *both* strings
are *zero length*!  I doubt the Python source tree has an official
position on what the first byte of a zero-length string should be, but
I assert that it should be undefined.  In practice, it's seemingly
always
set to the same value in the released Python, but with my changes it's
now possible for them to be different.

I think it's sinful to compare the first bytes of zero-length strings.
Therefore I suggest this bugfix:

		if (a->ob_size == b->ob_size
/* >> change >> */  && ( (!a->ob_size || a->ob_sval[0] ==
b->ob_sval[0])
			&& memcmp(a->ob_sval, b->ob_sval,

I've already incorporated this fix in my code.

__________
THE FUTURE

If this patch is accepted, it paves the way for another cheap
optimization: string slices could be done without copying.
Add another field into the PyString structure: a reference
to another PyString *.  If non-NULL, it means the local ob_sval
actually points into the ob_sval owned by that other PyString *.
It'd be another four bytes, but we'd only need to incur that
when creating a slice; we could set a bit in ob_sstate indicating
"this is a slice", and if so we'd have to copy-on-write ob_sval,
and free the reference when the object was destroyed.

______________
THE SUBMISSION

I don't know the protocol from this point out; should I email the patch
somewhere?  Put it up on a web page?  Post some diff output?  (Or just
forget about it entirely?)

Also, for what it's worth: I fixed the .dsp so pythoncore builds under
VC6 again.  I would be happy to submit that separately.  (What can I
say, old habits die hard.  I can't stand the newer Microsoft IDEs.)

Cheers,

/larry/

p.s. No, I haven't looked at a Unicode port yet.  If there's interest
in
this patch at all, I might take a stab at it.