doing hundreds of re.subs efficiently on large strings
Bengt Richter
bokr at oz.net
Tue Mar 25 20:02:01 EST 2003
On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
>I have a simple application that converts a file in one format to a file
>in another format using string.replace and re.sub. There are about 150
>replacements occuring on files that are on the order of a few KB. Right
>now, I represent the file contents as a string, and naively use
>string.replace and re.sub to replace all the substrings that I need to
>replace. I knew that I would have to come up with a faster, more memory
>friendly solution eventually, but I just wanted to implement the basic
>idea as simply as possible at first.
>
>Now I am at a loss as to how to proceed. I need a stringbuffer
>equivalent that supports at least re.sub. Two alternatives that I have
>seen for StringBuffer are using lists and then joining into a string at
>the end, or using an array. Neither of these seems to be of much use to
>me, since I am doing more than just appending to the end of the string.
>Is there another pre-existing alternative that I'm overlooking? Or has
>anybody come up with a good solution for this issue?
>
Is your substitution recursive? I.e., does each substitution operate on
the finished result of a previous substitution? As in
>>> 'abc'.replace('b','x').replace('xc','z')
'az'
If not, you could split on the befores and then walk through the list
and substitute corresponding afters and join the result, e.g.,
An example source:
>>> s = """\
... before1, before2 plain stuff
... and before3 and before4, and
... some more plain stuff.
... """
>>> print s
before1, before2 plain stuff
and before3 and before4, and
some more plain stuff.
Regex to split out befores:
>>> import re
>>> rxo = re.compile(r'(before1|before2|before3|before4)')
The parens retain the matches as the odd indexed items:
>>> rxo.split(s)
['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
A dict to look up substitutions:
>>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
>>> subdict
{'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}
As above, but bind to s, so we can do substitutions on the odd elements:
>>> s = rxo.split(s)
>>> s
['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
Do the substitution:
>>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
...
>>> s
['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']
Join into single string:
>>> ''.join(s)
'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'
Print it to see in original format:
>>> print ''.join(s)
after1, after2 plain stuff
and after3 and after4, and
some more plain stuff.
You could easily wrap this in a function, of course.
Regards,
Bengt Richter
More information about the Python-list
mailing list