modifying small chunks from long string

Steven D'Aprano steve at REMOVEMEcyber.com.au
Mon Nov 14 03:04:09 EST 2005


MackS wrote:

> This program is meant to process relatively long strings (10-20 MB) by
> selectively modifying small chunks one at a time. 

...

> Can I get over this performance problem without reimplementing the
> whole thing using a barebones list object? 

Is that a problem? Are you using string methods?

Some possibilities are:

array.array('c')
UserString.MutableString
cStringIO

depending on exactly what you are trying to do, but if 
it is just straight forward replacement, (i.e. you 
never actually change the length of the string) I'd 
guess the list idiom would be simplest.

If you have enough memory, just keep two copies of the 
string, the original and the modified:

size =  1024*1024*20  # 20 MB
original = "A" * size
copy = [None] * size
for i in range(size):
     copy[i] = original[i].lower()
copy = ''.join(copy)

This takes 530 seconds (8 minutes) on my 
not-especially-powerful system. Ouch. How does that 
compare to your code?

If you can operate on moderately sized chunks of text 
at a time rather than one character at a time, you'll 
see some performance increase.

Also the above code not only has a 20MB string, plus a 
20MB list, it also carries around a list of 20+ million 
int objects. If I was paying more attention before I 
hit the enter key, I would have used xrange instead of 
range and saved a lot of memory.

It also has to assemble the new 20MB string before 
garbage-collecting the character list. It is possible 
that the above test code has well in excess of 100MB of 
data at its peak.

With those problems in mind, I try this:

from __future__ import division

size =  1024*1024*20  # 20 MB
original = "A" * size
copy = []
blocksize = 1024
for i in xrange(size//blocksize):
    copy.append( \
    original[i*blocksize:(i+1)*blocksize].lower() )
copy = ''.join(copy)

This makes a BIG difference: from 8 minutes down to 0.5 
seconds. That's a speed-up by a factor of about one 
thousand.

I was so amazed at the speed increase I re-wrote the 
code with a couple of simple tests, then ran it again. 
Same result. Unless I've made some stupid mistake, I 
think this is the way to go.



-- 
Steven.




More information about the Python-list mailing list