doing hundreds of re.subs efficiently on large strings
Bengt Richter
bokr at oz.net
Sat Mar 29 19:12:51 EST 2003
On Thu, 27 Mar 2003 21:45:16 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
>Bengt Richter wrote:
>> On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
><snip/>
>> If not, you could split on the befores and then walk through the list
>> and substitute corresponding afters and join the result, e.g.,
>>
>> An example source:
>>
>> >>> s = """\
>> ... before1, before2 plain stuff
>> ... and before3 and before4, and
>> ... some more plain stuff.
>> ... """
>> >>> print s
>> before1, before2 plain stuff
>> and before3 and before4, and
>> some more plain stuff.
>>
>> Regex to split out befores:
>> >>> import re
>> >>> rxo = re.compile(r'(before1|before2|before3|before4)')
>>
>> The parens retain the matches as the odd indexed items:
>> >>> rxo.split(s)
>> ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
>>
>> A dict to look up substitutions:
>> >>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
>> >>> subdict
>> {'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}
>>
>> As above, but bind to s, so we can do substitutions on the odd elements:
>> >>> s = rxo.split(s)
>> >>> s
>> ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
>>
>> Do the substitution:
>> >>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
>> ...
>> >>> s
>> ['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']
>>
>> Join into single string:
>> >>> ''.join(s)
>> 'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'
>>
>> Print it to see in original format:
>> >>> print ''.join(s)
>> after1, after2 plain stuff
>> and after3 and after4, and
>> some more plain stuff.
>>
>> You could easily wrap this in a function, of course.
>>
>> Regards,
>> Bengt Richter
>
>I finally got around to actually testing the time for each of these
>approaches. Method 1 was with a bunch of string.replaces, and method 2
>was compiling everything into one big regex, spliting the string, and
>using a dictionary for lookup of the term to substitute for all the
>odd-numbered items in the list (the matched terms). The results were
>very surprising. string.replace averaged .25 seconds, while the method
>outlined above averaged .43 seconds!
>
>The bottleneck in my code turned out to be the regular expressions,
>which take almost a second and a half in total. Since they all contain
>groups, I think I'm stuck with what I have at present, though I'm
>wondering whether it is worthwhile to precompile all the expressions and
>use cPickle to load them from a file.
>
>Anyway, thanks again for the help. Your solution is more elegant than a
>hundred and fifty replaces, and better memory-wise too, even if it is a
>bit slower.
>
Thanks for testing, that is interesting. It would be interesting to see
the test code you ran, so we could see the reasons for .25 vs .43 (and
I could revise my reality-model as necessary ;-)
Regards,
Bengt Richter
More information about the Python-list
mailing list