Problem with SRE's regular expressions

Mon Mar 4 18:06:55 EST 2002

In article <3C83EA12.5030406 at free.fr>, Christophe Delord wrote:
> Thanks for your explanation!
> 
> In my real program, I had a more complex regexp. I need to match text 
> enclosed in << ... >>
> I first have used such an expression : <<([^>]|>[^>])*>>
> but I prefered <<.*?>> because it is more readable.
> 
> I'll listen to the advice and use my old regexp to see how faster it is 
> this way ;-)
> 
> Thanks,
> Christophe.

#
# Actually, it looks like you can do at least sixty times better than
# the re you have above.
#

import sys, time, re

print sys.version

def timefn(fn):
    t = time.time()
    fn()
    t = time.time() - t
    print "%40s %f" % (fn.__name__, t)

m#Very Long String 
vls = "<<" + " <that's still a very very big string!> "*100 + ">>"

def simple_version():
    ' This is the version from your post '
    r = re.compile(r'<<([^>]|>[^>])*>>')
    for i in xrange(10000): r.match(vls)

def non_grouping_match():
    ' This one uses non-grouping matches '
    r = re.compile(r'<<(?:[^>]|>[^>])*>>')
    for i in xrange(10000): r.match(vls)

def match_with_big_groups():
    ''' This one uses [^>]+ inside the group, in order to make each group
        as large as possible. '''
    r = re.compile(r'<<([^>]+|>[^>])*>>')
    for i in xrange(10000): r.match(vls)

def non_grouping_match_with_big_groups():
    ' This version uses both of the above tricks. '
    r = re.compile(r'<<(?:[^>]+|>[^>])*>>')
    for i in xrange(10000): r.match(vls)

def nongreedy_match():
    " And for comparison, here's the one with the nongreedy match. "
    r = re.compile(r'<<.*?>>')
    for i in xrange(10000): r.match(vls)

timefn(simple_version)
timefn(non_grouping_match)
timefn(match_with_big_groups)
timefn(non_grouping_match_with_big_groups)
timefn(nongreedy_match)

#
# Platform:
#    2.2+ (#30, Jan 29 2002, 21:20:45) 
#    [GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)]
#
# OUTPUT
#                          simple_version 37.967640
#                      non_grouping_match 34.108361
#                   match_with_big_groups 0.818009
#      non_grouping_match_with_big_groups 0.634849
#                         nongreedy_match 16.892410

So it looks like the big win is to use [^>]+ inside your parenthesis.
It's a slight improvement to use (?: ) instead of ( ), but not much.

Bonus observation 1: In this case, the overhead from matching a
parenthesized expression over and over is more than than the
backtracking overhead from '.*?' .

Bonus observation 2: I purposefully put some '>' characters into the
test string.  When I took them out, I found that the 60X performance
improvement turned into a 332X performance.  Thus, the winning pattern
above is fastest of all when the parenthesized expression matches only
once.

It-looks-like-I'm-going-to-have-to-go-rewrite-all-the-scanners-I'm-
  maintaining-ly Y'rs

-- 
 Nick Mathewson    <Q nick Q m at alum dot mit dot edu>
                      Remove Q's to respond.  No spam.