How to get the "longest possible" match with Python's RE module?

Tim Peters tim.peters at gmail.com
Tue Sep 12 23:09:00 EDT 2006


[Bryan Olson]
>>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
>>> book

[Tim Peters]
>> Strongly disagree: [...]

[Bryan]
> I know I'm disagreeing with a lot of smart people in panning
> the book.

That's allowed :-)

>>> What Python uses is search-and-backtrack. Unfortunately such
>>> engines don't have much theory behind them, and it's hard to
>>> reason generally about what they do.

>> Yup X 3, and the last is precisely why Friedl's book is valuable for
>> people using "NFA" implementations:  Friedl does a good job of
>> explaining when and why you're likely to trigger atrocious runtime
>> performance, and gives practical general techniques for avoiding those
>> problems.  None of that has anything to do with CompSci regexps
>> either, but is vital knowledge for people hoping to make happy
>> non-trivial use of Python/Perl/etc regexps.

> I read the first edition, minus some engine-specific stuff.
> General techniques are what I want, and I didn't see them in the
> book. He had things to do and not do, but it didn't add up to
> building (ir-)regular expressions with reliably tractable
> behavior.

My guess then is that the book moved at too slow a pace for you, and
you got bored.  The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

    normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint.  Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.

For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal:  double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):

    " [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

The "normal case" is that it's just a regular old character:  not a
double quote, backslash, or newline.  The "special case" is a
backslash followed by any character whatsoever.  The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end)  fits the
"unrolling" pattern to a tee.

As a result, you can use it with reasonably high confidence.  For
/why/ that's so, read the rest of his book ;-)  In effect, it gives
the search engine only one possible course of action at each character
in the target string.  So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part.  The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).

A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense".  It's a set of obscure & specialized learned skills.



More information about the Python-list mailing list