Speed of string += string

Sat Apr 12 21:22:13 EDT 2003

[Roy Smith]
> One of the very few things I like about C++ over Python is that the STL
> is very careful to document the complexity of the various operations.
> AFAICT, no such documentation exists for Python.

That's so.  It will be repaired as soon as someone volunteers to write the
docs <wink>.

> True, it's documented that strings are immutable, and from that you
> should be able to infer that string addition requires lots of creating
> and destroying of objects, which sounds like it should work out to
> being quadradic, but that's not quite the same as having the
> algorithmic complexity spelled out for you.

Not the same at all.  For example, string[i:j] in Python takes time
proportional to j-i, but that doesn't *follow* from immutability (there are
more complicated implementation schemes than Python uses that would allow
faster slicing -- in return for losing elsewhere).

> As another example, I can guess that Python dictionaries work like STL
> maps on the inside, but that would be just a guess.  Is inserting an
> element logN?

In all versions of CPython released to date, expected time O(1), worst O(N),
and no amoritized bound guaranteed better than O(N).  If you supply a
particularly poor __hash__ implementation, or have spectacularly unlucky key
distributions, it can be expected-case O(N).  Note that these don't count
the time for hashing the key.  If, e.g., the key is a string s, and s isn't
interned, and s hasn't been hashed before, then hash(s) takes O(len(s)) time
by itself; but if s has been hashed before, hash(s) takes constant time (its
hash code is cached in the string object).  So the full story can be quite
involved.

> Is keys() linear?

Yes, and that's both best and worst case.

> Is del on an element logN?  Is has_key() logN?

Both the same as insert.

> Those would be my guesses.  But they're just guesses.

Now you know.  The *language* doesn't guarantee any of this, though, and
probably never will.

Guido worked on an earlier language (ABC) that used implementation
techniques with good worst-case bounds, chiefly balanced binary trees.  But
the overhead of those schemes was a major drag on performance for most apps
most of the time.  Python's implementation is generally more pragmatic,
aiming for simpler schemes with excellent normal-case performance, not
worrying much about worst-case performance.

For example, until Python 2.3, the worst-case performance of list.sort() was
quadratic (although in 1.5.2 that was extremely difficult to provoke even
with a deep understanding of the 1.5.2 samplesort's implementation; it was
easy to provoke in pre-1.5.2 quicksort variants).

Python 2.3 has a new worst-case N log N sort, not because it has better
worst-case performance, but because the 2.3 sort proved much faster than the
2.2 sort in many common cases where the list already has significant
pre-existing order, and proved no slower than the 2.2 sort when the input
data is randomly ordered.  Worst-case analysis played no significant role in
deciding to switch to the new sort implementation, and I expect we would
have switched even if the new sort's worst-case performance were cubic (but
exceedingly rare).