Pathological regular expression

Sat Apr 11 19:46:20 EDT 2009

On Apr 12, 3:40 am, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
> >> To my mind, this is a bug in the RE engine. Is there any reason to not
> >> treat it as a bug?
>
> > IMHO it's not a bug -- s/hang/takes a long time to compute/
>
> > Just look at it: 2 + operators and 3 * operators ... It's one of those
> > "come back after lunch" REs.
>
> Well, it's been running now for about two and a half hours, that's a
> rather long lunch. And despite MRAB's assertion, it *cannot* be
> interrupted by ctrl-C. That means that to all intents and purposes, the
> interpreter has locked up for the duration of the calculation, which may
> be days or weeks for all I know.

If you don't know, experiment!

import re, time
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
for end in ("# yadda", ""):
    for start in ("", '"#"'):
        for size in xrange(25):
            line = start + "x" * size + end
            print line
            t0 = time.clock()
            result = _re_comments_nc.sub(r"\1", line)
            t1 = time.clock()
            print "%s (%.6f secs)" % (result, t1 - t0)

Observations:
1. Runs at light speed in first two cases (genuine comemnt to be
removed)
2. Takes approx O(2**N) in the 3rd case (no # in sight)
3. Takes even longer to produce a *WRONG* result in the 4th case (# in
a string literal)
4. The RE assumes that only " can quote a string literal; what about '
''' and """?

Cluesticks:
1. if "#" not in line:
    dont_need_no_stinking_regexes()
2. try timing on short pieces of input
3. Test the RE, get it correct