Efficient string concatenation methods

Raymond Hettinger python at rcn.com
Sun May 2 21:52:27 EDT 2004


ocrow at skymind.com (Oliver Crow) wrote in message news:<b3a34c4c.0405011643.70c163c3 at posting.google.com>...
> As a realtive python newb, but an old hack in general, I've been
> interested in the impact of having string objects (and other
> primitives) be immutable.  It seems to me that string concatenation is
> a rather common operation, and in Python having immutable strings
> results in a performance gotcha for anyone not aware of the impact of
> doing lots of concatenation in the obvious way.
> 
> I found several sources with advice for how to do concatenation in a
> pythonic way (e.g. ref#1), but I hadn't seen any measurements or
> comparisons. So, I put together a little test case and ran it through
> for six different methods.  Here's the results, and my conclusions:
> 
> http://www.skymind.com/~ocrow/python_string/
> 
> I'd be happy to hear if anyone else has done similar tests and if
> there are any other good candidate methods that I missed.

Nice work.  Beautifully presented.

A few thoughts:

* The relative speed of list.append() and list comprehensions versus
other approaches will improve dramatically in Py2.4.  IOW, the
rankings are version specific.  Also, append works faster if you
precompute the attribute lookup before the loop:  build =
str_list.append.

* The underlying issue has little to do with the immutability of
strings.  The underlying C code is fearless and would mutate the heck
out of it if that is what it took without being visible through the
API.  The real issue is how much space to allocate for the final
string.  When iterating, as you do, there is little forewarning about
how many more entries to expect or the length of each.  If too little
space is allocated, then room for a larger string needs to be created
and the previous result recopied (giving the O(n**2) behavior you're
trying to get rid of).

* For arrays, the advantage of mutability does not play out through
the fromstring() method (which suffers the previously mentioned
allocation issues).  The way to capitalize is to make a long array all
at once and then make slice assignments into it to add the strings. 
This is essentially what str.join() does under the hood (it loops over
the input once and adds up the string lengths, allocates all necessary
space, and re-loops making slice assignments) -- this is how it gets
its O(n) behavior.  For arrays, I suspect the overhead of tracking the
slice pointer would overwhelm the benefits for small values of n.  An
interesting comparison is to time "ans=[]; build=ans.append; for i in
xrange(n): build(i)"  versus "ans=[None]*n: for i in xrange(n):
ans[i]=i".


Raymond Hettinger



More information about the Python-list mailing list