doing hundreds of re.subs efficiently on large strings

Tue Mar 25 20:02:01 EST 2003

On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:

>I have a simple application that converts a file in one format to a file 
>in another format using string.replace and re.sub. There are about 150 
>replacements occuring on files that are on the order of a few KB.  Right 
>now, I represent the file contents as a string, and naively use 
>string.replace and re.sub to replace all the substrings that I need to 
>replace. I knew that I would have to come up with a faster, more memory 
>friendly solution eventually, but I just wanted to implement the basic 
>idea as simply as possible at first.
>
>Now I am at a loss as to how to proceed. I need a stringbuffer 
>equivalent that supports at least re.sub. Two alternatives that I have 
>seen for StringBuffer are using lists and then joining into a string at 
>the end, or using an array. Neither of these seems to be of much use to 
>me, since I am doing more than just appending to the end of the string. 
>Is there another pre-existing alternative that I'm overlooking? Or has 
>anybody come up with a good solution for this issue?
>
Is your substitution recursive? I.e., does each substitution operate on
the finished result of a previous substitution?  As in

 >>> 'abc'.replace('b','x').replace('xc','z')
 'az'

If not, you could split on the befores and then walk through the list
and substitute corresponding afters and join the result, e.g.,

An example source:

 >>> s = """\
 ... before1, before2 plain stuff
 ... and before3 and before4, and
 ... some more plain stuff.
 ... """
 >>> print s
 before1, before2 plain stuff
 and before3 and before4, and
 some more plain stuff.

Regex to split out befores:
 >>> import re
 >>> rxo = re.compile(r'(before1|before2|before3|before4)')

The parens retain the matches as the odd indexed items:
 >>> rxo.split(s)
 ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']

A dict to look up substitutions:
 >>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
 >>> subdict
 {'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}

As above, but bind to s, so we can do substitutions on the odd elements:
 >>> s = rxo.split(s)
 >>> s
 ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']

Do the substitution:
 >>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
 ...
 >>> s
 ['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']

Join into single string:
 >>> ''.join(s)
 'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'

Print it to see in original format:
 >>> print ''.join(s)
 after1, after2 plain stuff
 and after3 and after4, and
 some more plain stuff.

You could easily wrap this in a function, of course.

Regards,
Bengt Richter