doing hundreds of re.subs efficiently on large strings

Sat Mar 29 19:12:51 EST 2003

On Thu, 27 Mar 2003 21:45:16 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:

>Bengt Richter wrote:
>> On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
><snip/>
>> If not, you could split on the befores and then walk through the list
>> and substitute corresponding afters and join the result, e.g.,
>> 
>> An example source:
>> 
>>  >>> s = """\
>>  ... before1, before2 plain stuff
>>  ... and before3 and before4, and
>>  ... some more plain stuff.
>>  ... """
>>  >>> print s
>>  before1, before2 plain stuff
>>  and before3 and before4, and
>>  some more plain stuff.
>> 
>> Regex to split out befores:
>>  >>> import re
>>  >>> rxo = re.compile(r'(before1|before2|before3|before4)')
>> 
>> The parens retain the matches as the odd indexed items:
>>  >>> rxo.split(s)
>>  ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
>> 
>> A dict to look up substitutions:
>>  >>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
>>  >>> subdict
>>  {'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}
>> 
>> As above, but bind to s, so we can do substitutions on the odd elements:
>>  >>> s = rxo.split(s)
>>  >>> s
>>  ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
>> 
>> Do the substitution:
>>  >>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
>>  ...
>>  >>> s
>>  ['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']
>> 
>> Join into single string:
>>  >>> ''.join(s)
>>  'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'
>> 
>> Print it to see in original format:
>>  >>> print ''.join(s)
>>  after1, after2 plain stuff
>>  and after3 and after4, and
>>  some more plain stuff.
>> 
>> You could easily wrap this in a function, of course.
>> 
>> Regards,
>> Bengt Richter
>
>I finally got around to actually testing the time for each of these 
>approaches. Method 1 was with a bunch of string.replaces, and method 2 
>was compiling everything into one big regex, spliting the string, and 
>using a dictionary for lookup of the term to substitute for all the 
>odd-numbered items in the list (the matched terms).  The results were 
>very surprising. string.replace averaged .25 seconds, while the method 
>outlined above averaged  .43 seconds!
>
>The bottleneck in my code turned out to be the regular expressions, 
>which take almost a second and a half in total. Since they all contain 
>groups, I think I'm stuck with what I have at present, though I'm 
>wondering whether it is worthwhile to precompile all the expressions and 
>use cPickle to load them from a file.
>
>Anyway, thanks again for the help. Your solution is more elegant than a 
>hundred and fifty replaces, and better memory-wise too, even if it is a 
>bit slower.
>
Thanks for testing, that is interesting. It would be interesting to see
the test code you ran, so we could see the reasons for .25 vs .43 (and
I could revise my reality-model as necessary ;-)

Regards,
Bengt Richter