doing hundreds of re.subs efficiently on large strings

Wed Mar 26 06:13:52 EST 2003

"nihilo" <exnihilo at NOmyrealCAPSbox.com> wrote:
> I have a simple application that converts a file in one format to a file 
> in another format using string.replace and re.sub. There are about 150 
> replacements occuring on files that are on the order of a few KB.  Right 
> now, I represent the file contents as a string, and naively use 
> string.replace and re.sub to replace all the substrings that I need to 
> replace. I knew that I would have to come up with a faster, more memory 
> friendly solution eventually, but I just wanted to implement the basic 
> idea as simply as possible at first.

Hmm, 150 replacements on a few-KB text, that shouldn't take long.  Why
do you need it to be faster and more memory-friendly?  Are there
realtime constraints?  Do you have a lot of these files that you need
to process in batch?

<plug>
If so, it might be worth taking a look at my Msub
http://home.inbound.dk/~andersjm/provide/msub/
in which all 150 replacements can be done efficiently in parallel.
</plug>

Staying in Python, and assuming that there are no overlaps between
matches, here's a way to do single-pass replace of all your
regexps.

Combine your 150 regexps into one big regexp.  (Say regexps A,B and C
combine into (A)|(B)|(C), you get the picture.)  Then use re.finditer
(assuming Python version >=2.2) to locate the matching substrings one
by one, as well as the intermediate unmatched parts.  Which component
was matched can be identified by looking at the position of the first
non-None entry in groups().  Collect the output (unmatched
parts+replacement strings) in a list of string fragments which you
''.join into one big string at the end.

- Anders