Problem with SRE's regular expressions

Tue Mar 5 13:13:18 EST 2002

Thanks for your benchmark.

In fact with <<([^>]|>[^>])*>> I have the same recursion limit problem 
and I use this expression : '<<((?:[^>]+|>[^>]+)*)>>'
It's very close to your solution (just one more '+')

I'm think of another solution : <<(>?[^>]+)*>>

def match_with_big_groups_2():
     r = re.compile(r'<<(>?[^>]+)*>>')
     for i in xrange(10000): r.match(vls)

def non_grouping_match_with_big_groups_2():
     r = re.compile(r'<<(?:>?[^>]+)*>>')
     for i in xrange(10000): r.match(vls)

With these two expressions I can run a little faster.

On my computer I get :

2.2 (#1, Feb 12 2002, 22:46:37)
[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)]
                           simple_version 62.705866
                       non_grouping_match 60.567762
                    match_with_big_groups 1.558770
       non_grouping_match_with_big_groups 1.232597
                          nongreedy_match 27.859934
                  match_with_big_groups_2 0.933109
     non_grouping_match_with_big_groups_2 0.779298

non_grouping_match_with_big_groups is 50X faster than simple_version

non_grouping_match_with_big_groups_2 is 80X faster than simple_version

I'll-rewrite-my-scanners-too-ly Y'rs,
Christophe.

Nick Mathewson wrote:

> In article <3C83EA12.5030406 at free.fr>, Christophe Delord wrote:
> 
>>Thanks for your explanation!
>>
>>In my real program, I had a more complex regexp. I need to match text 
>>enclosed in << ... >>
>>I first have used such an expression : <<([^>]|>[^>])*>>
>>but I prefered <<.*?>> because it is more readable.
>>
>>I'll listen to the advice and use my old regexp to see how faster it is 
>>this way ;-)
>>
>>Thanks,
>>Christophe.
>>
> 
> #
> # Actually, it looks like you can do at least sixty times better than
> # the re you have above.
> #
> 
> import sys, time, re
> 
> print sys.version
> 
> def timefn(fn):
>     t = time.time()
>     fn()
>     t = time.time() - t
>     print "%40s %f" % (fn.__name__, t)
> 
> m#Very Long String 
> vls = "<<" + " <that's still a very very big string!> "*100 + ">>"
> 
> def simple_version():
>     ' This is the version from your post '
>     r = re.compile(r'<<([^>]|>[^>])*>>')
>     for i in xrange(10000): r.match(vls)
> 
> def non_grouping_match():
>     ' This one uses non-grouping matches '
>     r = re.compile(r'<<(?:[^>]|>[^>])*>>')
>     for i in xrange(10000): r.match(vls)
> 
> def match_with_big_groups():
>     ''' This one uses [^>]+ inside the group, in order to make each group
>         as large as possible. '''
>     r = re.compile(r'<<([^>]+|>[^>])*>>')
>     for i in xrange(10000): r.match(vls)
> 
> def non_grouping_match_with_big_groups():
>     ' This version uses both of the above tricks. '
>     r = re.compile(r'<<(?:[^>]+|>[^>])*>>')
>     for i in xrange(10000): r.match(vls)
> 
> def nongreedy_match():
>     " And for comparison, here's the one with the nongreedy match. "
>     r = re.compile(r'<<.*?>>')
>     for i in xrange(10000): r.match(vls)
> 
> 
> timefn(simple_version)
> timefn(non_grouping_match)
> timefn(match_with_big_groups)
> timefn(non_grouping_match_with_big_groups)
> timefn(nongreedy_match)
> 
> #
> # Platform:
> #    2.2+ (#30, Jan 29 2002, 21:20:45) 
> #    [GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)]
> #
> # OUTPUT
> #                          simple_version 37.967640
> #                      non_grouping_match 34.108361
> #                   match_with_big_groups 0.818009
> #      non_grouping_match_with_big_groups 0.634849
> #                         nongreedy_match 16.892410
> 
> So it looks like the big win is to use [^>]+ inside your parenthesis.
> It's a slight improvement to use (?: ) instead of ( ), but not much.
> 
> Bonus observation 1: In this case, the overhead from matching a
> parenthesized expression over and over is more than than the
> backtracking overhead from '.*?' .
> 
> Bonus observation 2: I purposefully put some '>' characters into the
> test string.  When I took them out, I found that the 60X performance
> improvement turned into a 332X performance.  Thus, the winning pattern
> above is fastest of all when the parenthesized expression matches only
> once.
> 
> It-looks-like-I'm-going-to-have-to-go-rewrite-all-the-scanners-I'm-
>   maintaining-ly Y'rs
> 
> 

-- 
Christophe Delord
http://christophe.delord.free.fr/