Why is regex so slow?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue Jun 18 21:51:26 EDT 2013
On Tue, 18 Jun 2013 12:45:29 -0400, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
>
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history -
> ENQUEUEING: /listen/the-station-one
>
> This code runs in 1.3 seconds:
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
> for line in open('error.log'):
> m = pattern.search(line)
> if m:
> count += 1
>
> print count
> ------------------------------
>
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)
You say that as if it were surprising. It's not. Regex searches have
higher overhead than bog standard dumb `substring in string` tests, so
reducing the number of high-overhead regex searches calls with a low-
overhead `in` test is often a nett win.
Not always, it depends on how many hits/misses you have, and the relative
overhead of each, but as a general rule, I would always try pre-filtering
as the obvious way to optimize a regex.
Even if the regex engine is just as efficient at doing simple character
matching as `in`, and it probably isn't, your regex tries to match all
eleven characters of "ENQUEUEING" while the `in` test only has to match
three, "ENQ".
[...]
> Once the regex is compiled, you should have a state machine pattern
> matcher. It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".
Yes, but O(N) doesn't tell you how long it takes to run, only how it
*scales*.
class MyDumbString(str):
def __contains__(self, substr):
time.sleep(1000)
return super(MyDumbString, self).__contains__(substring)
MyDumbString `in` is also O(N), but it is a wee less efficient than the
built-in version...
Regexes do a lot of work, because they are far more complex than dumb
string __contains__. It should not surprise you that the overhead is
greater, even when matching a plain-ol' substring with no metacharacters.
Especially since Python does not traditionally use regexes for
everything, like Perl does, and so has not spent the effort to complicate
the implementation in order to squeeze every last microsecond out of it.
> And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well. Why is doing
> twice the work also twice the speed?
It's not doing twice the work. It's doing less work, overall, by spending
a moment to trivially filter out the lines which cannot possibly match,
before spending a few moments to do a more careful match. If the number
of non-matching lines is high, as it is in your data, then the cheaper
pre-filter pays for itself and then some.
--
Steven
More information about the Python-list
mailing list