Why TERRIBLE performance of regular expressions in re module ?
Tim Peters
tim_one at email.msn.com
Tue Oct 5 01:06:04 EDT 1999
[rturpin at my-deja.com]
> While debugging a CGI script, I came across the following
> strange behavior: a regular expression that seems to
> require exponential time to recognize a string.
> ...
> It doesn't take a genius to recognize this progression
> and decide not to proceed. What puzzles me is this:
> regular expressions have a well-recognized EFFICIENT
> recognition algorithm: they are turned into state machines
> that perform LINEARLY on the length of the input string.
Python doesn't use that approach; neither does Perl, for that matter. See
Jeffrey Friedl's book "Mastering Regular Expressions" (O'Reilly) for
details. One hangup is that Python/Perl "regular expressions" aren't
regular -- tricks like backreferences and capturing parens go beyond what a
DFA can do. Another hangup is that a linear-time DFA recognizer can take
exponential time to build. A third hangup is that a DFA recognizer isn't
necessarily the fastest for a large class of regexps, although Perl is much
better than Python at exploiting the possibilities here (e.g., Perl can do a
fast Boyer-Moore search for a literal embedded substring that needs to
match, skipping-- at high speed --over vast stretches of input that can't
possibly match overall due to not containing a match for the embedded
literal).
> I don't think I am doing anything strange. (Though I don't
> rule it out).
You probably have nested quantifiers ("+" and "*" thingies) that suffer
exponential ambiguity. Post the regexp and someone may speed it up. See
Friedl's book for a long and detailed course in what's needed.
> Has anyone seen this behavior,
Thousands.
> or does anyone have an explanation for it?
One or two <wink>.
yet-another-happy-regexp-user-ly y'rs - tim
More information about the Python-list
mailing list