regexp speed concerns

Fri Jan 10 17:44:09 EST 2003

Andrew Dalke <adalke at mindspring.com> wrote in message news:<avlhs0$cg5$1 at slb9.atl.mindspring.net>...
> You can also use a non-regex, as in
> 
> import sys
> 
> def main(fh):
>      errSO = re.compile('[\s^](ERR|err|FAIL)')
>      finishFound = 0
>      looking_for = ("ERR", "err", "FAIL")
>      for line in fh:
>          words = line.split()
>          if words and words[0] in looking_for:
>              finishFound = 1
>              break
>       ...

Note that this code implements a restatement of the OP's
"requirements" -- his regexp matches words such as error, errata,
FAILED, FAILURE, and also things like ERR987 and FAIL-1234.

That said, the code can possibly be sped up by (a) using a dictionary
(b) stopping the split after the first word, viz:

     looking_for = {"ERR":1, "err":1, "FAIL":1}
     for line in fh:
         words = line.split(None, 1)
         if words and words[0] in looking_for:

As an aside, beware the subtle undocumented special treatment of the
default delimiter case:

>>> "---err---".split("-")
['', '', '', 'err', '', '', '']

>>> "   err   ".split(" ")
['', '', '', 'err', '', '', '']

>>> "   err   ".split("   ")
['', 'err', '']

>>> "   err   ".split()
['err']
Consistency with the non-default cases would produce ['', 'err', ''].