How to get the "longest possible" match with Python's RE module?

Bryan Olson fakeaddress at nowhere.org
Wed Sep 13 00:12:16 EDT 2006


Tim Peters wrote:
> [...]  The most valuable general technique he (eventually ;-)
> explained he called "unrolling", and consists of writing a regexp in
> the form:
> 
>    normal* (?: special normal* )*
> 
> where the sets of characters with which `normal` and `special` can
> start are disjoint.  Under most "NFA" implementation, such a regexp is
> immune to most bad behaviors, whether finding very long matches, or in
> searching a very long string that happens not to contain any matches.

[...]
> As a result, you can use it with reasonably high confidence.  For
> /why/ that's so, read the rest of his book ;-)  In effect, it gives
> the search engine only one possible course of action at each character
> in the target string.  So long as that's "normal", it has to stay in
> the "normal" part of the regexp, and as soon as it's "special" it has
> to leave the "normal" part.  The intent is to eliminate any
> possibility for backtracking, thus ensuring predictable runtime and
> ensuring that no form of internal backtracking stack needs to be used
> (let alone hit its limit).

So how general is that? The books just says "certain common
expressions". I think the real answer is that one can unroll
exactly the true regular expressions. The unrolling is
essentially resolving the states of a DFA by hand.


-- 
--Bryan



More information about the Python-list mailing list