Regular expression help
Fredrik Lundh
fredrik at pythonware.com
Thu Jul 17 02:44:50 EDT 2003
David Lees wrote:
> I forget how to find multiple instances of stuff between tags using
> regular expressions. Specifically I want to find all the text between a
> series of begin/end pairs in a multiline file.
>
> I tried:
> >>> p = 'begin(.*)end'
> >>> m = re.search(p,s,re.DOTALL)
>
> and got everything between the first begin and last end. I guess
> because of a greedy match. What I want to do is a list where each
> element is the text between another begin/end pair.
people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).
a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"
import re
pos = 0
START = re.compile("begin")
END = re.compile("end")
while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward
at this point, it's also obvious that you don't really have to use
regular expressions:
pos = 0
while 1:
start = text.find("begin", pos)
if start < 0:
break
start += 5
end = text.find("end", start)
if end < 0:
break
process(text[start:end])
pos = end # move forward
</F>
<!-- (the eff-bot guide to) the python standard library (redux):
http://effbot.org/zone/librarybook-index.htm
-->
More information about the Python-list
mailing list