Problem with SRE's regular expressions
Nick Mathewson
QnickQm at alum.mit.edu
Mon Mar 4 18:06:55 EST 2002
In article <3C83EA12.5030406 at free.fr>, Christophe Delord wrote:
> Thanks for your explanation!
>
> In my real program, I had a more complex regexp. I need to match text
> enclosed in << ... >>
> I first have used such an expression : <<([^>]|>[^>])*>>
> but I prefered <<.*?>> because it is more readable.
>
> I'll listen to the advice and use my old regexp to see how faster it is
> this way ;-)
>
> Thanks,
> Christophe.
#
# Actually, it looks like you can do at least sixty times better than
# the re you have above.
#
import sys, time, re
print sys.version
def timefn(fn):
t = time.time()
fn()
t = time.time() - t
print "%40s %f" % (fn.__name__, t)
m#Very Long String
vls = "<<" + " <that's still a very very big string!> "*100 + ">>"
def simple_version():
' This is the version from your post '
r = re.compile(r'<<([^>]|>[^>])*>>')
for i in xrange(10000): r.match(vls)
def non_grouping_match():
' This one uses non-grouping matches '
r = re.compile(r'<<(?:[^>]|>[^>])*>>')
for i in xrange(10000): r.match(vls)
def match_with_big_groups():
''' This one uses [^>]+ inside the group, in order to make each group
as large as possible. '''
r = re.compile(r'<<([^>]+|>[^>])*>>')
for i in xrange(10000): r.match(vls)
def non_grouping_match_with_big_groups():
' This version uses both of the above tricks. '
r = re.compile(r'<<(?:[^>]+|>[^>])*>>')
for i in xrange(10000): r.match(vls)
def nongreedy_match():
" And for comparison, here's the one with the nongreedy match. "
r = re.compile(r'<<.*?>>')
for i in xrange(10000): r.match(vls)
timefn(simple_version)
timefn(non_grouping_match)
timefn(match_with_big_groups)
timefn(non_grouping_match_with_big_groups)
timefn(nongreedy_match)
#
# Platform:
# 2.2+ (#30, Jan 29 2002, 21:20:45)
# [GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)]
#
# OUTPUT
# simple_version 37.967640
# non_grouping_match 34.108361
# match_with_big_groups 0.818009
# non_grouping_match_with_big_groups 0.634849
# nongreedy_match 16.892410
So it looks like the big win is to use [^>]+ inside your parenthesis.
It's a slight improvement to use (?: ) instead of ( ), but not much.
Bonus observation 1: In this case, the overhead from matching a
parenthesized expression over and over is more than than the
backtracking overhead from '.*?' .
Bonus observation 2: I purposefully put some '>' characters into the
test string. When I took them out, I found that the 60X performance
improvement turned into a 332X performance. Thus, the winning pattern
above is fastest of all when the parenthesized expression matches only
once.
It-looks-like-I'm-going-to-have-to-go-rewrite-all-the-scanners-I'm-
maintaining-ly Y'rs
--
Nick Mathewson <Q nick Q m at alum dot mit dot edu>
Remove Q's to respond. No spam.
More information about the Python-list
mailing list