Cancel or timeout a long running regular expression

Fri Sep 16 18:01:27 EDT 2011

On 9/16/2011 9:57 AM, Nobody wrote:
>>I wonder if there are any
>> heuristics for detecting exponential time re's.
>
> Exponential growth results from non-determinism, i.e. if there are
> multiple transitions for a given character from a given state.
>
> Common patterns include:
>
> 	...(a...)?a...
> 	...(a...)*a...
> 	...(a...)+a...
>
> with a choice between matching the "a" at the start of the bracketed
> pattern or the "a" following it.
>
> 	(xxxa...|xxxb...|xxxc...)
>
> with a choice between branches which cannot be resolved until more data
> has been read.
>
> For re.search (as opposed to re.match):
>
> 	axxxa...
>
> When axxx has been read, a following "a" gives a choice between
> continuing the existing match with the second "a", or aborting and
> matching against the first "a". For patterns which contain many copies of
> the initial character, each copy creates another branch.
>
> Also, using back-references in a regexp [sic] can be a significant
> performance killer, as it rules out the use of a DFA (which, IIRC, Python
> doesn't do anyhow) and can require brute-forcing many combinations. A
> particularly bad case is:
>
> 	(a*)\1
>
> matching against "aaaaaaaa...".
>
> If the performance issue is with re.match/re.search rather than with
> re.compile, one option is to use ctypes to access libc's regexp functions.
> Those are likely to provide better searching throughput at the expense of
> potentially increased compilation time.

Now, can you write that as a heuristic *algorithm* def 
dangerous_re(some_re):?

-- 
Terry Jan Reedy