[Tutor] working with strings in python3
Steven D'Aprano
steve at pearwood.info
Tue Apr 19 19:50:56 CEST 2011
Rance Hall wrote:
> On Mon, Apr 18, 2011 at 9:50 PM, Marc Tompkins <marc.tompkins at gmail.com> wrote:
>> On Mon, Apr 18, 2011 at 6:53 PM, Rance Hall <ranceh at gmail.com> wrote:
>>
>>> I'm going to go ahead and use this format even though it is deprecated
>>> and then later when we upgrade it I can fix it.
>>>
>> And there you have your answer.
>>
>>> A list might make sense, but printing a message one word at a time
>>> doesn't seem to me like much of a time saver.
>>
>> Did you try my example code? It doesn't "print a message one word at a
>> time"; any time you print " ".join(message), you get the whole thing. Put a
>> \n between the quotes, and you get the whole thing on separate lines.
>>
>
> I think you misunderstood me, I simply meant that the print "
> ".join(message) has to parse through each word in order to get any
> output, I didn't mean to suggest that you got output one word at a
> time. Sorry for the confusion.
Well, yes, but you have to walk over each word at some point. The join
idiom merely puts that off until just before you need the complete
string, instead of walking over them over and over and over again.
That's why the join idiom is usually better: it walks over each string
once, while repeated concatenation has the potential to walk over each
one dozens, hundreds or thousands of times (depending on how many
strings you have to concatenate). To be precise: if there are N strings
to add, the join idiom does work proportional to N, while the repeated
concatenation idiom does work proportional to N*N.
This is potentially *so* catastrophic for performance that recent
versions of CPython actually go out of its way to protect you from it
(other Python, like Jython, IronPython and PyPy might not). But with a
little bit of extra work, we can shoot ourselves in the foot and see how
bad *repeated* string concatenation can be:
>>> from timeit import Timer
>>>
>>> class Magic:
... def __add__(self, other):
... return other
...
>>> m = Magic()
>>> strings = ['a']*10000
>>>
>>> t1 = Timer('"".join(strings)', 'from __main__ import strings')
>>> t2 = Timer('sum(strings, m)', 'from __main__ import strings, m')
>>>
>>> t1.timeit(1000) # one thousand timing iterations
1.0727810859680176
>>> t2.timeit(1000)
19.48655891418457
In Real Life, the performance hit can be substantial. Some time ago
(perhaps a year?) there was a bug report that copying files over the
network was *really* slow in Python. By memory, the bug report was that
to download a smallish file took Internet Explorer about 0.1 second, the
wget utility about the same, and the Python urllib module about TEN
MINUTES. To cut a long story short, it turned out that the module in
question was doing repeated string concatenation. Most users never
noticed the problem because Python now has a special optimization that
detects repeated concatenation and does all sorts of funky magic to make
it smarter and faster, but for this one user, there was some strange
interaction between how Windows manages memory and the Python optimizer,
the magic wasn't applied, and consequently the full inefficiency of the
algorithm was revealed in all it's horror.
Bottom line: unless you have actually timed your code and have hard
measurements showing different, you should always expect repeated string
concatenation to be slow and the join idiom to be fast.
--
Steven
More information about the Tutor
mailing list