[Tutor] working with strings in python3

Tue Apr 19 19:50:56 CEST 2011

Rance Hall wrote:
> On Mon, Apr 18, 2011 at 9:50 PM, Marc Tompkins <marc.tompkins at gmail.com> wrote:
>> On Mon, Apr 18, 2011 at 6:53 PM, Rance Hall <ranceh at gmail.com> wrote:
>>
>>> I'm going to go ahead and use this format even though it is deprecated
>>> and then later when we upgrade it I can fix it.
>>>
>> And there you have your answer.
>>
>>> A list might make sense, but printing a message one word at a time
>>> doesn't seem to me like much of a time saver.
>>
>> Did you try my example code?  It doesn't "print a message one word at a
>> time"; any time you print " ".join(message), you get the whole thing.  Put a
>> \n between the quotes, and you get the whole thing on separate lines.
>>
> 
> I think you misunderstood me, I simply meant that the print "
> ".join(message) has to parse through each word in order to get any
> output, I didn't mean to suggest that you got output one word at a
> time.  Sorry for the confusion.

Well, yes, but you have to walk over each word at some point. The join 
idiom merely puts that off until just before you need the complete 
string, instead of walking over them over and over and over again. 
That's why the join idiom is usually better: it walks over each string 
once, while repeated concatenation has the potential to walk over each 
one dozens, hundreds or thousands of times (depending on how many 
strings you have to concatenate). To be precise: if there are N strings 
to add, the join idiom does work proportional to N, while the repeated 
concatenation idiom does work proportional to N*N.

This is potentially *so* catastrophic for performance that recent 
versions of CPython actually go out of its way to protect you from it 
(other Python, like Jython, IronPython and PyPy might not). But with a 
little bit of extra work, we can shoot ourselves in the foot and see how 
bad *repeated* string concatenation can be:

 >>> from timeit import Timer
 >>>
 >>> class Magic:
...     def __add__(self, other):
...         return other
...
 >>> m = Magic()
 >>> strings = ['a']*10000
 >>>
 >>> t1 = Timer('"".join(strings)', 'from __main__ import strings')
 >>> t2 = Timer('sum(strings, m)', 'from __main__ import strings, m')
 >>>
 >>> t1.timeit(1000)  # one thousand timing iterations
1.0727810859680176
 >>> t2.timeit(1000)
19.48655891418457

In Real Life, the performance hit can be substantial. Some time ago 
(perhaps a year?) there was a bug report that copying files over the 
network was *really* slow in Python. By memory, the bug report was that 
to download a smallish file took Internet Explorer about 0.1 second, the 
wget utility about the same, and the Python urllib module about TEN 
MINUTES. To cut a long story short, it turned out that the module in 
question was doing repeated string concatenation. Most users never 
noticed the problem because Python now has a special optimization that 
detects repeated concatenation and does all sorts of funky magic to make 
it smarter and faster, but for this one user, there was some strange 
interaction between how Windows manages memory and the Python optimizer, 
the magic wasn't applied, and consequently the full inefficiency of the 
algorithm was revealed in all it's horror.

Bottom line: unless you have actually timed your code and have hard 
measurements showing different, you should always expect repeated string 
concatenation to be slow and the join idiom to be fast.

-- 
Steven