Why is regex so slow?

Duncan Booth duncan.booth at invalid.invalid
Wed Jun 19 09:21:43 EDT 2013


Roy Smith <roy at panix.com> wrote:

> Except that the complexity in regexes is compiling the pattern down to
> a FSM.  Once you've got the FSM built, the inner loop should be pretty
> quick. In C, the inner loop for executing a FSM should be something
> like: 
> 
> for(char* p = input; p; ++p) {
>     next_state = current_state[*p];
>     if (next_state == MATCH) {
>         break;
>    }
> }
> 
> which should compile down to a couple of machine instructions which
> run entirely in the instruction pipeline cache.  But I'm probably
> simplifying it more than I should :-) 

I'd just like to point out that your simple loop is looking at every 
character of the input string. The simple "'ENQ' not in line" test can look 
at the third character of the string and if it's none of 'E', 'N' or 'Q' 
skip to checking the 6th and then the 9th. It doesn't have to touch the 
intervening characters at all.

Or as the source puts it: it's a mix between Boyer-Moore and Horspool, with 
a few more bells and whistles on the top.

Also the regex library has to do a whole lot more than just figuring out if 
it got a match, so you have massively over-simplified it.

-- 
Duncan Booth http://kupuguy.blogspot.com



More information about the Python-list mailing list