regexp speed concerns

Fri Jan 10 17:06:17 EST 2003

quioxl at yahoo.com (Jonathan Craft) wrote in message news:<e15e4bd1.0301091611.411bafc2 at posting.google.com>...

> Just for fun, I converted the script from perl to python as a learning
> experience.  One of the first things I noticed was about a 3x increase
> in the average time for a parse to take place between my original perl
> script and my first attempt at a python replacement.  I managed to get
> that down to about a 2x multiplier when I replaced "re" calls with the
> "regex" equivalent (got that idea from a post somewhere, can't
> remember).

The regex module is deep in Norwegian blue parrot territory, six feet
deep. Current Python (version 2.2.x) deprecates it at runtime, and
doesn't supply the documentation any more. You evidently have an
antique Python (1.5.2??). I suggest that you upgrade ASAP. You should
find that the re module is faster.

> 1).  Should such a slowdown be expected?  The expression that I'm
> searching for is pretty simple (whitespace or start of line, followed
> by 1 of 3 possible strings), but I was suprised to see my python code
> slow down so visibly.

Concern yourself with correctness first, speed much later if at all.
The subexpression that you are using below i.e. [\s^] does not match
at start of line; the square brackets enclose a class (set) of
characters; it will match a whitespace character or a literal ^
character (in Python definitely and in Perl it should unless the
implementation is broken). Your regexp won't match if your error text
is at the start of the line. Did you test this case?

> 3).  Should I be using "match" instead of "search"?  I know that the
> algorithms used under the hood differ for each, but I'm not sure which
> one is the best to use under these conditions.

How do you know what they do under the hood??? In any case this is far
less important than (a) a precise statement of your problem and (b)
reading the documentation so that you know what they do *above* the
hood.

Briefly, match() reports whether the pattern matches at the start
position, whereas search() ... well, would you believe it, it searches
for a match anywhere.

If your error text occurs anywhere in the line, you should be using
search() with '(\s|^)(ERR|err|FAIL)'

If your error text occurs only at the start of the line, possibly
preceded by exactly one (unlikely) whitespace character, use match()
with '[\s]? etc

If your error text occurs only at the start of the line, preceded by
zero or more whitespace characters, use match() with '[\s]* etc

Maybe your error text could be preceded by punctuation e.g. "Whoopsie
#413 (FAILED)" in which case you could use \b (beginning of word) ...

Are you sure in your assumption that neither 'fail' nor 'Fail' occur?
Are you willing to bet that some programmer/user/pointed-haired-boss
won't make that assumption invalid in the future?

> First Python Attempt:
> ---------------------
> import re
> errSO = re.compile('[\s^](ERR|err|FAIL)')

Doesn't match start of line, matches a literal ^ character; see above.

Hope this helps,
John