doing hundreds of re.subs efficiently on large strings

Wed Mar 26 23:45:22 EST 2003

Bengt Richter wrote:
> On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
> 
> 
>>I have a simple application that converts a file in one format to a file 
>>in another format using string.replace and re.sub. There are about 150 
>>replacements occuring on files that are on the order of a few KB.  Right 
>>now, I represent the file contents as a string, and naively use 
>>string.replace and re.sub to replace all the substrings that I need to 
>>replace. I knew that I would have to come up with a faster, more memory 
>>friendly solution eventually, but I just wanted to implement the basic 
>>idea as simply as possible at first.
>>
>>Now I am at a loss as to how to proceed. I need a stringbuffer 
>>equivalent that supports at least re.sub. Two alternatives that I have 
>>seen for StringBuffer are using lists and then joining into a string at 
>>the end, or using an array. Neither of these seems to be of much use to 
>>me, since I am doing more than just appending to the end of the string. 
>>Is there another pre-existing alternative that I'm overlooking? Or has 
>>anybody come up with a good solution for this issue?
>>
> 
> Is your substitution recursive? I.e., does each substitution operate on
> the finished result of a previous substitution?  As in
> 
>  >>> 'abc'.replace('b','x').replace('xc','z')
>  'az'

Yes, this is the case. The result of one replace is the input for the 
next. Is it better to cascade them like this? Perhaps I am 
misunderstanding what happens with the cascading calls, but I thought 
that it would be the same as doing it the obvious way except that the 
intermediate strings are unable to be referenced when they are cascaded 
togehter.

Thanks for the really nice solution below. I will adapt my code and 
handle the regular expression substitutions as you have suggested.

cheers,

nihilo

> If not, you could split on the befores and then walk through the list
> and substitute corresponding afters and join the result, e.g.,
> 
> An example source:
> 
>  >>> s = """\
>  ... before1, before2 plain stuff
>  ... and before3 and before4, and
>  ... some more plain stuff.
>  ... """
>  >>> print s
>  before1, before2 plain stuff
>  and before3 and before4, and
>  some more plain stuff.
> 
> Regex to split out befores:
>  >>> import re
>  >>> rxo = re.compile(r'(before1|before2|before3|before4)')
> 
> The parens retain the matches as the odd indexed items:
>  >>> rxo.split(s)
>  ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
> 
> A dict to look up substitutions:
>  >>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
>  >>> subdict
>  {'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}
> 
> As above, but bind to s, so we can do substitutions on the odd elements:
>  >>> s = rxo.split(s)
>  >>> s
>  ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
> 
> Do the substitution:
>  >>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
>  ...
>  >>> s
>  ['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']
> 
> Join into single string:
>  >>> ''.join(s)
>  'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'
> 
> Print it to see in original format:
>  >>> print ''.join(s)
>  after1, after2 plain stuff
>  and after3 and after4, and
>  some more plain stuff.
> 
> You could easily wrap this in a function, of course.
> 
> Regards,
> Bengt Richter