Faster Regular Expressions

Fri Mar 10 02:56:16 EST 2000

nkipp at vt.edu wrote:

> I post here because these results should go in the searchable
> archive.

no, they shouldn't ;-)

despite your good intentions, your benchmark is flawed.  read on.

> I ran the following regular expression speed test (P166 workstation
> running Linux).  The results are below.  Function "fastMatch" can be
> six times (6x) faster.

You cannot use the profiler to calculate relative performance
in this way.  Since it's an instrumenting profiler, it slows down
Python code (like the glue code in 'pattern.match').  But it
doesn't slow down C functions (like 'pattern.code.match') at
all!

A more correct benchmark gives a difference of just over 2x.

    re: 100
    re-fast:  43

(times are normalized -- original 're' is 100)

> It seems to me that Tatu Ylonen (apparent author of regexp.c) did
> his job well and that the re/mo wrapper in re.py slows everything
> down.

But you're not testing Tatu Ylonen's regexp package (used by the
'regex' module) -- you're short-circuiting the 're' interface layer
built on top of Philip Hazel's PCRE library.

The 'regex' module provides less features, is much slower on complex
patterns, and isn't thread safe.  But it has less calling overhead,
since it doesn't use an extra Python layer.

After tweaking your pattern to work with 'regex', I get the following
result:

    regex: 40

Slightly faster than your 're' hack, and this one doesn't use any
undocumented features.

And note that they're not only undocumented; they will also be
gone in 1.6.  The new engine will implement the full 're' syntax
and the same interface, it's fast also on relatively complex patterns
and large strings, and has a very tight Python interface:

    sre: 26

That's just under 4x.  Not too bad, imho.

For additional benchmarks, see

    http://www.deja.com/=dnc/getdoc.xp?AN=588925502

</F>