Regular expression help

Thu Jul 17 17:15:00 EDT 2003

On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <fredrik at pythonware.com> wrote:

>David Lees wrote:
>
>> I forget how to find multiple instances of stuff between tags using
>> regular expressions.  Specifically I want to find all the text between a
>> series of begin/end pairs in a multiline file.
>>
>> I tried:
>>  >>> p = 'begin(.*)end'
>>  >>> m = re.search(p,s,re.DOTALL)
>>
>> and got everything between the first begin and last end.  I guess
>> because of a greedy match.  What I want to do is a list where each
>> element is the text between another begin/end pair.
>
>people will tell you to use non-greedy matches, but that's often a
>bad idea in cases like this: the RE engine has to store lots of back-
would you say so for this case? Or how like this case?

>tracking information, and your program will consume a lot more
>memory than it has to (and may run out of stack and/or memory).
For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .*  and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?

>at this point, it's also obvious that you don't really have to use
>regular expressions:
>
>    pos = 0
>
>    while 1:
>        start = text.find("begin", pos)
>        if start < 0:
>            break
>        start += 5
>        end = text.find("end", start)
>        if end < 0:
>            break
>        process(text[start:end])
>        pos = end # move forward
>
></F>

Or breaking your loop with an exception instead of tests:

 >>> text = """begin s1 end
 ... sdfsdf
 ... begin s2 end
 ... """

 >>> def process(s): print 'processing(%r)'%s
 ...
 >>> try:
 ...     end = 0 # end of previous search
 ...     while 1:
 ...         start = text.index("begin", end) + 5
 ...         end = text.index("end", start)
 ...         process(text[start:end])
 ... except ValueError:
 ...    pass
 ...
 processing(' s1 ')
 processing(' s2 ')

Or if you're guaranteed that every begin has an end, you could also write

 >>> for begxxx in text.split('begin')[1:]:
 ...    process(begxxx.split('end')[0])
 ...
 processing(' s1 ')
 processing(' s2 ')

Regards,
Bengt Richter