doing hundreds of re.subs efficiently on large strings

Thu Mar 27 00:50:16 EST 2003

Anders J. Munch wrote:
> "nihilo" <exnihilo at NOmyrealCAPSbox.com> wrote:
> 
>>I have a simple application that converts a file in one format to a file 
>>in another format using string.replace and re.sub. There are about 150 
>>replacements occuring on files that are on the order of a few KB.  Right 
>>now, I represent the file contents as a string, and naively use 
>>string.replace and re.sub to replace all the substrings that I need to 
>>replace. I knew that I would have to come up with a faster, more memory 
>>friendly solution eventually, but I just wanted to implement the basic 
>>idea as simply as possible at first.
> 
> 
> Hmm, 150 replacements on a few-KB text, that shouldn't take long.  Why
> do you need it to be faster and more memory-friendly?  Are there
> realtime constraints?  Do you have a lot of these files that you need
> to process in batch?

Actually, I'm apparently really bad at estimating. I just checked a 
couple of files, and they are from 20 KB to 60 KB, and there are 
something like 200 replaces and perhaps 100 regular expression 
substitutions, some of which are pretty complex. There are realtime 
constraints, since the application basically downloads a file and 
processes it while a user is waiting. It already takes a second or so to 
download the file, and the processing takes from a couple to six or 
seven seconds, which is already too much for impatient users like me ;-)

> <plug>
> If so, it might be worth taking a look at my Msub
> http://home.inbound.dk/~andersjm/provide/msub/
> in which all 150 replacements can be done efficiently in parallel.
> </plug>

Thanks, I checked out your application. It looks great, but I only doing 
one file at a time, and I would like to stick with a purely python solution.

> Staying in Python, and assuming that there are no overlaps between
> matches, here's a way to do single-pass replace of all your
> regexps
> Combine your 150 regexps into one big regexp.  (Say regexps A,B and C
> combine into (A)|(B)|(C), you get the picture.)  Then use re.finditer
> (assuming Python version >=2.2) to locate the matching substrings one
> by one, as well as the intermediate unmatched parts.  

How do I use finditer to find the unmatched parts also (the stuff 
between the matches)? It looks to me like it just gives you access to 
the stuff that is matched, but maybe I'm missing something.

> Which component
> was matched can be identified by looking at the position of the first
> non-None entry in groups().  Collect the output (unmatched
> parts+replacement strings) in a list of string fragments which you
> ''.join into one big string at the end.
> 
> - Anders

I am sorry but I am still not sure how you are suggesting the 
replacement occur. After I have the iterator, what would I do with each 
of the match objects in the iterator?

Sorry for my confusion, but I'm pretty new to python and regular 
expressions.

thanks,

nihilo