regexp speed concerns

Fri Jan 10 12:20:56 EST 2003

"Jonathan Craft" <quioxl at yahoo.com> wrote ...
> Greetings,
>
> First off, I'm not looking to start a general perl vs. python
> discussion here, I'm just someone who has used perl in the past in my
> normal work and decided to try python out for fun.  Needless to say,
> I'm new to python and don't really have any in-depth experience with
> either language.  I have a slight feeling that my questions below may
> be misguided or irrelevant.
>
Ah. Looking for a fight, eh? :-)

> I had an original perl script that, in addition to several other
> functions, would parse simulation logfiles for specific errors that
> indicate failures.
>
> Just for fun, I converted the script from perl to python as a learning
> experience.  One of the first things I noticed was about a 3x increase
> in the average time for a parse to take place between my original perl
> script and my first attempt at a python replacement.  I managed to get
> that down to about a 2x multiplier when I replaced "re" calls with the
> "regex" equivalent (got that idea from a post somewhere, can't
> remember).  I parse files that are anywhere from a few hundred to a
> few million lines of text, and up to 300 differet files.  As I'm
> parsing them one line at a time via a for/in loop, I doubt file size
> really matters.  If you think I should look at improving the way I
> handle file I/O instead of regular expressions, let me know.
>
Andrew Dalke has givein you competent advice in that respect, so I'll leave
this topic alone except to observe that there are usually ways to improve
speed.

> I have a few questions:
> 1).  Should such a slowdown be expected?  The expression that I'm
> searching for is pretty simple (whitespace or start of line, followed
> by 1 of 3 possible strings), but I was suprised to see my python code
> slow down so visibly.  I've been unable to find a clear answer
> anywhere on which lang had faster regexp capabilities, which leads me
> to believe that the result varies on the situation.

Tim Peters puts it well when he says that Perl has "optimized the snot out
of regular expressions". While Python's implementation is no slouch, I am
not surprised to see a speed discrepancy of that nature (though sometimes it
goes the other way). However, it's difficult to compare *just* the regexp
features for the two languages unless you operate on data in memory rather
than reading it from a file.

> 2).  Given my assumption that either language can be faster than the
> other given the application, are there any 'rules of thumb' that I can
> follow to make sure I'm making the python expressions behave as
> quickly as possible?

The three important things to do are measure, measure and measure.

> 3).  Should I be using "match" instead of "search"?  I know that the
> algorithms used under the hood differ for each, but I'm not sure which
> one is the best to use under these conditions.
>
See my remarks under 2).

> The code below is paraphrased...the actual code much bulkier, but the
> key lines are captured below.  The "First Python Attempt" may not be
> syntatically correct, as it was replaced by later attempts and I don't
> have it handy anymore.
>
> Original Perl:
> --------------
> for $line in (<fh>) {
>   if (!$errFound) {
>     if ($line =~ m|[\s^](ERR|err|FAIL)|) {
>       $errFound = 1;
>     }
>   }
> }
>
> First Python Attempt:
> ---------------------
> import re
> errSO = re.compile('[\s^](ERR|err|FAIL)')
> for line in fh:
>   if not finishFound:
>     match = errSO.search(line)
>     if match != None: finishFound = 1
>
> Latest Python Attempt:
> ----------------------
> import regex
> errSO = regex.compile('\( \|^\)\(ERR\|err\|FAIL\)')
> for line in fh:
>   if not finishFound:
>     if doneSO.search(line) >= 0: finishFound = 1
>   if not errFound:
>     if errSO.search(line) >= 0: errFound = 1
>
Since you are using doneSo.search I presume this is supposed to find "ERR"
or "err" or "FAIL" either at the beginning of a line or following a space? I
can't see why match would be any better here: I'd be surprised if search
weren't much faster that an explicit pattern that skipped the unwanted
characters.

Given that Python's logical operations short-circuit, it might be marginally
quicker to write

    if not finishFound and doneSO.search(line):
        finishFound = 1

and the like. Bad practice to elide the guarded line like you did, though:
the style sharks will get you! One stylistic question: why "finishFound" and
"doneSO" rather than "finishFound" and "finishSO"?

regards
--
Steve Holden                                  http://www.holdenweb.com/
Python Web Programming                 http://pydish.holdenweb.com/pwp/
Bring your musical instrument to PyCon!    http://www.python.org/pycon/