Why is regex so slow?

Tue Jun 18 21:51:26 EDT 2013

On Tue, 18 Jun 2013 12:45:29 -0400, Roy Smith wrote:

> I've got a 170 MB file I want to search for lines that look like:
> 
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history -
> ENQUEUEING: /listen/the-station-one
> 
> This code runs in 1.3 seconds:
> 
> ------------------------------
> import re
> 
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0 
> for line in open('error.log'):
>     m = pattern.search(line)
>     if m:
>         count += 1
> 
> print count
> ------------------------------
> 
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)

You say that as if it were surprising. It's not. Regex searches have 
higher overhead than bog standard dumb `substring in string` tests, so 
reducing the number of high-overhead regex searches calls with a low-
overhead `in` test is often a nett win.

Not always, it depends on how many hits/misses you have, and the relative 
overhead of each, but as a general rule, I would always try pre-filtering 
as the obvious way to optimize a regex.

Even if the regex engine is just as efficient at doing simple character 
matching as `in`, and it probably isn't, your regex tries to match all 
eleven characters of "ENQUEUEING" while the `in` test only has to match 
three, "ENQ".

[...]
> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".

Yes, but O(N) doesn't tell you how long it takes to run, only how it 
*scales*.

class MyDumbString(str):
    def __contains__(self, substr):
        time.sleep(1000)
        return super(MyDumbString, self).__contains__(substring)

MyDumbString `in` is also O(N), but it is a wee less efficient than the 
built-in version...

Regexes do a lot of work, because they are far more complex than dumb 
string __contains__. It should not surprise you that the overhead is 
greater, even when matching a plain-ol' substring with no metacharacters. 
Especially since Python does not traditionally use regexes for 
everything, like Perl does, and so has not spent the effort to complicate 
the implementation in order to squeeze every last microsecond out of it.

> And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.  Why is doing
> twice the work also twice the speed?

It's not doing twice the work. It's doing less work, overall, by spending 
a moment to trivially filter out the lines which cannot possibly match, 
before spending a few moments to do a more careful match. If the number 
of non-matching lines is high, as it is in your data, then the cheaper 
pre-filter pays for itself and then some.

-- 
Steven