Regular expression help

Fredrik Lundh fredrik at pythonware.com
Thu Jul 17 02:44:50 EDT 2003


David Lees wrote:

> I forget how to find multiple instances of stuff between tags using
> regular expressions.  Specifically I want to find all the text between a
> series of begin/end pairs in a multiline file.
>
> I tried:
>  >>> p = 'begin(.*)end'
>  >>> m = re.search(p,s,re.DOTALL)
>
> and got everything between the first begin and last end.  I guess
> because of a greedy match.  What I want to do is a list where each
> element is the text between another begin/end pair.

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

    import re

    pos = 0

    START = re.compile("begin")
    END = re.compile("end")

    while 1:
        m = START.search(text, pos)
        if not m:
            break
        start = m.end()
        m = END.search(text, start)
        if not m:
            break
        end = m.start()
        process(text[start:end])
        pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

    pos = 0

    while 1:
        start = text.find("begin", pos)
        if start < 0:
            break
        start += 5
        end = text.find("end", start)
        if end < 0:
            break
        process(text[start:end])
        pos = end # move forward

</F>

<!-- (the eff-bot guide to) the python standard library (redux):
http://effbot.org/zone/librarybook-index.htm
-->








More information about the Python-list mailing list