Why is regex so slow?

Tue Jun 18 13:31:51 EDT 2013

On 18/06/2013 17:45, Roy Smith wrote:
> I've got a 170 MB file I want to search for lines that look like:
>
> [2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
>
> This code runs in 1.3 seconds:
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
>
> for line in open('error.log'):
>      m = pattern.search(line)
>      if m:
>          count += 1
>
> print count
> ------------------------------
>
> If I add a pre-filter before the regex, it runs in 0.78 seconds (about
> twice the speed!)
>
> ------------------------------
> import re
>
> pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
> count = 0
>
> for line in open('error.log'):
>      if 'ENQ' not in line:
>          continue
>      m = pattern.search(line)
>      if m:
>          count += 1
>
> print count
> ------------------------------
>
> Every line which contains 'ENQ' also matches the full regex (61425
> lines match, out of 2.1 million total).  I don't understand why the
> first way is so much slower.
>
> Once the regex is compiled, you should have a state machine pattern
> matcher.  It should be O(n) in the length of the input to figure out
> that it doesn't match as far as "ENQ".  And that's exactly how long it
> should take for "if 'ENQ' not in line" to run as well.  Why is doing
> twice the work also twice the speed?
>
> I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
>
I'd be interested in how the 'regex' module 
(http://pypi.python.org/pypi/regex) compares. :-)