Problem with SRE's regular expressions
Christophe Delord
christophe.delord at free.fr
Tue Mar 5 13:13:18 EST 2002
Thanks for your benchmark.
In fact with <<([^>]|>[^>])*>> I have the same recursion limit problem
and I use this expression : '<<((?:[^>]+|>[^>]+)*)>>'
It's very close to your solution (just one more '+')
I'm think of another solution : <<(>?[^>]+)*>>
def match_with_big_groups_2():
r = re.compile(r'<<(>?[^>]+)*>>')
for i in xrange(10000): r.match(vls)
def non_grouping_match_with_big_groups_2():
r = re.compile(r'<<(?:>?[^>]+)*>>')
for i in xrange(10000): r.match(vls)
With these two expressions I can run a little faster.
On my computer I get :
2.2 (#1, Feb 12 2002, 22:46:37)
[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)]
simple_version 62.705866
non_grouping_match 60.567762
match_with_big_groups 1.558770
non_grouping_match_with_big_groups 1.232597
nongreedy_match 27.859934
match_with_big_groups_2 0.933109
non_grouping_match_with_big_groups_2 0.779298
non_grouping_match_with_big_groups is 50X faster than simple_version
non_grouping_match_with_big_groups_2 is 80X faster than simple_version
I'll-rewrite-my-scanners-too-ly Y'rs,
Christophe.
Nick Mathewson wrote:
> In article <3C83EA12.5030406 at free.fr>, Christophe Delord wrote:
>
>>Thanks for your explanation!
>>
>>In my real program, I had a more complex regexp. I need to match text
>>enclosed in << ... >>
>>I first have used such an expression : <<([^>]|>[^>])*>>
>>but I prefered <<.*?>> because it is more readable.
>>
>>I'll listen to the advice and use my old regexp to see how faster it is
>>this way ;-)
>>
>>Thanks,
>>Christophe.
>>
>
> #
> # Actually, it looks like you can do at least sixty times better than
> # the re you have above.
> #
>
> import sys, time, re
>
> print sys.version
>
> def timefn(fn):
> t = time.time()
> fn()
> t = time.time() - t
> print "%40s %f" % (fn.__name__, t)
>
> m#Very Long String
> vls = "<<" + " <that's still a very very big string!> "*100 + ">>"
>
> def simple_version():
> ' This is the version from your post '
> r = re.compile(r'<<([^>]|>[^>])*>>')
> for i in xrange(10000): r.match(vls)
>
> def non_grouping_match():
> ' This one uses non-grouping matches '
> r = re.compile(r'<<(?:[^>]|>[^>])*>>')
> for i in xrange(10000): r.match(vls)
>
> def match_with_big_groups():
> ''' This one uses [^>]+ inside the group, in order to make each group
> as large as possible. '''
> r = re.compile(r'<<([^>]+|>[^>])*>>')
> for i in xrange(10000): r.match(vls)
>
> def non_grouping_match_with_big_groups():
> ' This version uses both of the above tricks. '
> r = re.compile(r'<<(?:[^>]+|>[^>])*>>')
> for i in xrange(10000): r.match(vls)
>
> def nongreedy_match():
> " And for comparison, here's the one with the nongreedy match. "
> r = re.compile(r'<<.*?>>')
> for i in xrange(10000): r.match(vls)
>
>
> timefn(simple_version)
> timefn(non_grouping_match)
> timefn(match_with_big_groups)
> timefn(non_grouping_match_with_big_groups)
> timefn(nongreedy_match)
>
> #
> # Platform:
> # 2.2+ (#30, Jan 29 2002, 21:20:45)
> # [GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)]
> #
> # OUTPUT
> # simple_version 37.967640
> # non_grouping_match 34.108361
> # match_with_big_groups 0.818009
> # non_grouping_match_with_big_groups 0.634849
> # nongreedy_match 16.892410
>
> So it looks like the big win is to use [^>]+ inside your parenthesis.
> It's a slight improvement to use (?: ) instead of ( ), but not much.
>
> Bonus observation 1: In this case, the overhead from matching a
> parenthesized expression over and over is more than than the
> backtracking overhead from '.*?' .
>
> Bonus observation 2: I purposefully put some '>' characters into the
> test string. When I took them out, I found that the 60X performance
> improvement turned into a 332X performance. Thus, the winning pattern
> above is fastest of all when the parenthesized expression matches only
> once.
>
> It-looks-like-I'm-going-to-have-to-go-rewrite-all-the-scanners-I'm-
> maintaining-ly Y'rs
>
>
--
Christophe Delord
http://christophe.delord.free.fr/
More information about the Python-list
mailing list