Why is regex so slow?

Tue Jun 18 16:05:25 EDT 2013

Roy Smith <roy <at> panix.com> writes:
> 
> Every line which contains 'ENQ' also matches the full regex (61425
> lines match, out of 2.1 million total).  I don't understand why the
> first way is so much slower.

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster. That's like saying "I don't understand why my CPU is slower
than my GPU at calculating 3D structures".

That said, there may be ways to improve the regex engine to deal with such
special cases in a speedier way. But there will still be some overhead related
to the fact that you are invoking a powerful generic engine rather than a
lean and mean specialized routine.

(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)

> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".  And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

Regards

Antoine.