Regular expression help
David Lees
abcdebl2nonspammy at verizon.net
Fri Jul 18 01:20:37 EDT 2003
Bengt Richter wrote:
> On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <abcdebl2nonspammy at verizon.net> wrote:
>
>
>>Andrew Bennetts wrote:
>>
>>>On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
>>>
>>>
>>>>I forget how to find multiple instances of stuff between tags using
>>>>regular expressions. Specifically I want to find all the text between a
>>>
>>> ^^^^^^^^
>>>
>>>How about re.findall?
>>>
>>>E.g.:
>>>
>>> >>> re.findall('BEGIN(.*?)END', 'BEGIN foo END BEGIN bar END')
>>> [' foo ', ' bar ']
>>>
>>>-Andrew.
>>>
>>>
>>
>>Actually this fails with the multi-line type of file I was asking about.
>>
>>
>>>>>re.findall('BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')
>>
>>[' bar ']
>>
>
> It works if you include the DOTALL flag (?s) at the beginning, which makes
> . also match \n: (BTW, (?si) would make it case-insensitive).
>
> >>> import re
> >>> re.findall('(?s)BEGIN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')
> [' foo\nmumble ', ' bar ']
>
> Regards,
> Bengt Richter
I just tried to benchmark both Fredrik's suggestions along with Bengt's
using the same input file. The results (looping 200 times over the 400k
file) are:
Fredrik, regex = 1.74003930667
Fredrik, no regex = 0.434207978947
Bengt, regex = 1.45420158149
Interesting how much faster the non-regex approach is.
Thanks again.
David Lees
The code (which I have not carefully checked) is:
import re, time
def timeBengt(s,N):
p = 'begin msc(.*?)end msc'
rx =re.compile(p,re.DOTALL)
t0 = time.clock()
for i in xrange(N):
x = x = rx.findall(s)
t1 = time.clock()
return t1-t0
def timeFredrik1(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
START = re.compile("begin")
END = re.compile("end")
while 1:
m = START.search(text, pos)
if not m:
break
start = m.end()
m = END.search(text, start)
if not m:
break
end = m.start()
pass
pos = m.end() # move forward
t1 = time.clock()
return t1-t0
def timeFredrik(text,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
while 1:
start = text.find("begin msc", pos)
if start < 0:
break
start += 9
end = text.find("end msc", start)
if end < 0:
break
pass
pos = end # move forward
t1 = time.clock()
return t1-t0
fh = open('scu.cfg','rb')
s = fh.read()
fh.close()
N = 200
print 'Fredrik, regex = ',timeFredrik1(s,N)
print 'Fredrik, no regex = ',timeFredrik(s,N)
print 'Bengt, regex = ',timeBengt(s,N)
More information about the Python-list
mailing list