Why is regex so slow?
Roy Smith
roy at panix.com
Wed Jun 19 15:55:16 EDT 2013
On Wednesday, June 19, 2013 9:21:43 AM UTC-4, Duncan Booth wrote:
> I'd just like to point out that your simple loop is looking at every
> character of the input string. The simple "'ENQ' not in line" test can look
> at the third character of the string and if it's none of 'E', 'N' or 'Q'
> skip to checking the 6th and then the 9th. It doesn't have to touch the
> intervening characters at all.
It's been a while since I looked at boyer-moore in detail. Looking at Objects/stringlib/fastsearch.h from the 2.7.4 source, it occurs to me that:
/* create compressed boyer-moore delta 1 table */
/* process pattern[:-1] */
for (i = 0; i < mlast; i++) {
STRINGLIB_BLOOM_ADD(mask, p[i]);
if (p[i] == p[mlast])
skip = mlast - i - 1;
}
/* process pattern[-1] outside the loop */
STRINGLIB_BLOOM_ADD(mask, p[mlast]);
is essentially (well, sort-if) the same as the compile() step of a regex. For the (presumably) common use case of searching many strings for the same substring (which is what we're doing here), it seems like it would be a win to cache the mask and reuse it if the search string id is the same as the last search string id. The overhead on cache misses would be a single pointer comparison. Has anybody looked at doing that?
More information about the Python-list
mailing list