Why is re.search() so much faster than re.sub() when there are no matches?

Wed May 16 06:15:08 EDT 2001

Thanks for that Tim.

The quotes around "line' were just a typo when I tried to get a simple
example. I never timed my real app but it semed an order of magnitude or
more slower..

Are the 100,000 Python function calls slow mainly because it is
re-interpreting the same code 100,000 times?

Regards,

Colin

"Tim Peters" <tim.one at home.com> wrote in message
news:mailman.989992043.17376.python-list at python.org...
> [News]
> > I don't understand why re.sub() is so slow if no substitutions are done:
> > The first loop in Active Python build 203 on Windows 2000 takes
> > 1.26 seconds and the second loop takes 49.3 seconds.
>
> Bizarre.  0.74 vs 7.2 seconds for me (Win98SE -- the king of
high-performance
> operating systems <wink>).
>
> > That's a huge difference.
>
> Well, most of it's your doing, but the rest isn't.  Read on.
>
> > I would have thought that sub() must do a regular expression search()
> > internally to see if there is anything to substitute,
>
> Yes.
>
> > and don't see why I can make it 39 times faster
>
> Me neither.
>
> > by explicitly doing the search first instead of letting re.sub() do it..
>
> But that's not what you did below:
>
> > import re
> > line = "fsfsaf sf saf sdafsfsadf sadfdsafsadfdsafsf fdsf sf sd f s f " \
> >    "sf saf safsfffff sdfsadf  f  sadf sa"
> > pattern = re.compile(r"\bword\b")
> > for i in range(1,100000):
> >     if patern.search("line"):
>
> Note that you're not searching the variable line here, you're searching
the
> 4-character string "line".  So of course this loop is going to run much
> faster than the next one (it's searching a much smaller string).
>
> >         line = pattern.sub("new word", line)
> > for i in range(1,100000):
> >     line = pattern.sub("new word", line)
>
> The rest of it has a deeper explanation:  pattern.search is implemented in
C,
> but pattern.sub is still implemented in Python.  Once you repair your
first
> loop to do the same thing as the second, the difference remaining is due
to
> the overhead of executing 100,000 Python-level functions.
>
>