regexp speed concerns

Andrew Dalke adalke at mindspring.com
Thu Jan 9 23:15:51 EST 2003


Jonathan Craft wrote:
> 1).  Should such a slowdown be expected?  The expression that I'm
> searching for is pretty simple (whitespace or start of line, followed
> by 1 of 3 possible strings), but I was suprised to see my python code
> slow down so visibly.  I've been unable to find a clear answer
> anywhere on which lang had faster regexp capabilities, which leads me
> to believe that the result varies on the situation.
> 2).  Given my assumption that either language can be faster than the
> other given the application, are there any 'rules of thumb' that I can
> follow to make sure I'm making the python expressions behave as
> quickly as possible?
> 3).  Should I be using "match" instead of "search"?  I know that the
> algorithms used under the hood differ for each, but I'm not sure which
> one is the best to use under these conditions.

There's way to speed it up.  See below.  One thing to bear in
mind is that Perl coders when starting to use Python use regexps
a lot more than normal Python coders.

Regarding match, yes, match would be faster since it only checks
the first character.  Search, as you do, does a search starting
from every character until out of character or success.

> First Python Attempt:
> ---------------------
> import re
> errSO = re.compile('[\s^](ERR|err|FAIL)')
> for line in fh:
>   if not finishFound:
>     match = errSO.search(line)
>     if match != None: finishFound = 1

Hmm... Mising a few variables.

Variable lookups in module scope are slower than in function
scope.  You don't need a None lookup, just do the implicit
check for truth.  Also, it looks like 'finishFound' is a flag
to stop procesing, so you can break at that point; you don't
need to read the rest of the file.


import re, sys

def main(fh):
     errSO = re.compile('[\s^](ERR|err|FAIL)')
     finishFound = 0
     for line in fh:
         if errSO.search(line):
             finishFound = 1
             break
      ...

if __name__ == "__main__":
   main(sys.stdin)


You can also use a non-regex, as in

import sys

def main(fh):
     errSO = re.compile('[\s^](ERR|err|FAIL)')
     finishFound = 0
     looking_for = ("ERR", "err", "FAIL")
     for line in fh:
         words = line.split()
         if words and words[0] in looking_for:
             finishFound = 1
             break
      ...

if __name__ == "__main__":
   main(sys.stdin)

					Andrew
					dalke at dalkescientific.com





More information about the Python-list mailing list