index for regex.search() beyond which the RE engine will not go.

Sat Aug 20 05:40:33 EDT 2016

On Friday, August 19, 2016 at 10:09:19 PM UTC+8, Steve D'Aprano wrote:
> On Fri, 19 Aug 2016 09:14 pm, iMath wrote:
> 
> > 
> > for
> > regex.search(string[, pos[, endpos]])
> > The optional parameter endpos is the index into the string beyond which
> > the RE engine will not go, while this lead me to believe the RE engine
> > will still search on till the endpos position even after it returned the
> > matched object, is this Right ?
> 
> No.
> 
> Once the RE engine finds a match, it stops. You can test this for yourself
> with a small timing test, using the "timeit" module.
> 
> from timeit import Timer
> huge_string = 'aaabc' + 'a'*1000000 + 'dea'
> re1 = r'ab.a'
> re2 = r'ad.a'
> 
> # set up some code to time.
> setup = 'import re; from __main__ import huge_string, re1, re2'
> t1 = Timer('re.search(re1, huge_string)', setup)
> t2 = Timer('re.search(re2, huge_string)', setup)
> 
> # Now run the timers.
> best = min(t1.repeat(number=1000))/1000
> print("Time to locate regex at the start of huge string:", best)
> best = min(t2.repeat(number=1000))/1000
> print("Time to locate regex at the end of the huge string:", best)
> 
> 
> 
> When I run that on my computer, it prints:
> 
> Time to locate regex at the start of huge string: 4.9710273742675785e-06
> Time to locate regex at the end of the huge string: 0.0038938069343566893
> 
> 
> So it takes about 4.9 microseconds to find the regex at the beginning of the
> string. To find the regex at the end of the string takes about 3893
> microseconds.
> 
> 
> The "endpos" parameter tells the RE engine to stop at that position if the
> regex isn't found before it. It won't go beyond that point.
> 
> 
> 
> 
> 
> 
> -- 
> Steve
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.

Thanks for clarifying