Why is regex so slow?

Wed Jun 19 15:55:16 EDT 2013

On Wednesday, June 19, 2013 9:21:43 AM UTC-4, Duncan Booth wrote:

> I'd just like to point out that your simple loop is looking at every 
> character of the input string. The simple "'ENQ' not in line" test can look 
> at the third character of the string and if it's none of 'E', 'N' or 'Q' 
> skip to checking the 6th and then the 9th. It doesn't have to touch the 
> intervening characters at all.

It's been a while since I looked at boyer-moore in detail.  Looking at Objects/stringlib/fastsearch.h from the 2.7.4 source, it occurs to me that:

        /* create compressed boyer-moore delta 1 table */

        /* process pattern[:-1] */
	for (i = 0; i < mlast; i++) {
            STRINGLIB_BLOOM_ADD(mask, p[i]);
            if (p[i] == p[mlast])
                skip = mlast - i - 1;
        }
        /* process pattern[-1] outside the loop */
        STRINGLIB_BLOOM_ADD(mask, p[mlast]);

is essentially (well, sort-if) the same as the compile() step of a regex.  For the (presumably) common use case of searching many strings for the same substring (which is what we're doing here), it seems like it would be a win to cache the mask and reuse it if the search string id is the same as the last search string id.  The overhead on cache misses would be a single pointer comparison.  Has anybody looked at doing that?