best way to replace first word in string?

Steven D'Aprano steve at REMOVETHIScyber.com.au
Sat Oct 22 07:05:43 EDT 2005


On Thu, 20 Oct 2005 10:25:27 -0700, Micah Elliott wrote:

> I thought that string concatenation was rather
> expensive, so its being faster than %-formatting surprised me a bit:

Think about what string concatenation actually does: 

s = "hello " + "world"

In pseudo-code, it does something like this:

- Count chars in "hello" (six chars).
- Count chars in "world" (five chars).
- Allocate eleven bytes.
- Copy six chars from "hello " and five from "world" into the newly
allocated bit of memory.

(This should not be thought of as the exact process that Python uses, but
simply illustrating the general procedure.)

Now think of what str-formatting would do:

s = "hello %s" % "world"

In pseudo-code, it might do something like this:

- Allocate a chunk of bytes, hopefully not too big or too small.
- Repeat until done:
    - Copy chars from the original string into the new string, 
    until it hits a %s placeholder.
    - Grab the next string from the args, and copy chars from 
    that into the new string. If the new string is too small, 
    reallocate memory to make it bigger, potentially moving 
    chunks of bytes around.

The string formatting pseudo-code is a lot more complicated and has to do
more work than just blindly copying bytes. It has to analyse the bytes it
is copying, looking for placeholders.

So string concatenation is more efficient, right? No. The thing is, a
*single* string concatenation is almost certainly more efficient than a
single string concatenation. But now look what happens when you repeat it:

s = "h" + "e" + "l" + "l" + "o" + " " + "w" + "o" + "r" + "l" + "d"

This ends up doing something like this:

- Allocate two bytes, copying "h" and "e" into them.
- Allocate three bytes, copying "he" and "l" into them.
- Allocate four bytes, copying "hel" and "l" into them.
... 
- Allocate eleven bytes, copying "hello worl" and "d" into them.

The problem is that string concatenation doesn't scale efficiently. String
formatting, on the other hand, does more work to get started, but scales
better.

See, for example, this test code:

py> def tester(n):
...     s1 = ""
...     s2 = "%s" * n
...     bytes = tuple([chr(i % 256) for i in range(n)])
...     t1 = time.time()
...     for i in range(n):
...             s1 = s1 + chr(i % 256)
...     t1 = time.time() - t1
...     t2 = time.time()
...     s2 = s2 % bytes
...     t2 = time.time() - t2
...     assert s1 == s2
...     print t1, t2
...
py> x = 100000
py> tester(x)
3.24212408066 0.01252317428
py> tester(x)
2.58376598358 0.01238489151
py> tester(x)
2.76262307167 0.01474809646

The string formatting is two orders of magnitude faster than the
concatenation. The speed difference becomes even more obvious when you
increase the number of strings being concatenated:

py> tester(x*10)
2888.56399703 0.13130998611

Almost fifty minutes, versus less than a quarter of a second.


-- 
Steven.




More information about the Python-list mailing list