doing hundreds of re.subs efficiently on large strings
nihilo
exnihilo at NOmyrealCAPSbox.com
Thu Mar 27 00:50:16 EST 2003
Anders J. Munch wrote:
> "nihilo" <exnihilo at NOmyrealCAPSbox.com> wrote:
>
>>I have a simple application that converts a file in one format to a file
>>in another format using string.replace and re.sub. There are about 150
>>replacements occuring on files that are on the order of a few KB. Right
>>now, I represent the file contents as a string, and naively use
>>string.replace and re.sub to replace all the substrings that I need to
>>replace. I knew that I would have to come up with a faster, more memory
>>friendly solution eventually, but I just wanted to implement the basic
>>idea as simply as possible at first.
>
>
> Hmm, 150 replacements on a few-KB text, that shouldn't take long. Why
> do you need it to be faster and more memory-friendly? Are there
> realtime constraints? Do you have a lot of these files that you need
> to process in batch?
Actually, I'm apparently really bad at estimating. I just checked a
couple of files, and they are from 20 KB to 60 KB, and there are
something like 200 replaces and perhaps 100 regular expression
substitutions, some of which are pretty complex. There are realtime
constraints, since the application basically downloads a file and
processes it while a user is waiting. It already takes a second or so to
download the file, and the processing takes from a couple to six or
seven seconds, which is already too much for impatient users like me ;-)
> <plug>
> If so, it might be worth taking a look at my Msub
> http://home.inbound.dk/~andersjm/provide/msub/
> in which all 150 replacements can be done efficiently in parallel.
> </plug>
Thanks, I checked out your application. It looks great, but I only doing
one file at a time, and I would like to stick with a purely python solution.
> Staying in Python, and assuming that there are no overlaps between
> matches, here's a way to do single-pass replace of all your
> regexps
> Combine your 150 regexps into one big regexp. (Say regexps A,B and C
> combine into (A)|(B)|(C), you get the picture.) Then use re.finditer
> (assuming Python version >=2.2) to locate the matching substrings one
> by one, as well as the intermediate unmatched parts.
How do I use finditer to find the unmatched parts also (the stuff
between the matches)? It looks to me like it just gives you access to
the stuff that is matched, but maybe I'm missing something.
> Which component
> was matched can be identified by looking at the position of the first
> non-None entry in groups(). Collect the output (unmatched
> parts+replacement strings) in a list of string fragments which you
> ''.join into one big string at the end.
>
> - Anders
I am sorry but I am still not sure how you are suggesting the
replacement occur. After I have the iterator, what would I do with each
of the match objects in the iterator?
Sorry for my confusion, but I'm pretty new to python and regular
expressions.
thanks,
nihilo
More information about the Python-list
mailing list