How to get the "longest possible" match with Python's RE module?

Tim Peters tim.peters at gmail.com
Sat Sep 16 19:03:33 EDT 2006


[Tim Peters]
>> [...]  The most valuable general technique [Friedl] (eventually ;-)
>> explained he called "unrolling", and consists of writing a regexp in
>> the form:
>>
>>    normal* (?: special normal* )*
>>
>> where the sets of characters with which `normal` and `special` can
>> start are disjoint.  Under most "NFA" implementation, such a regexp is
>> immune to most bad behaviors, whether finding very long matches, or in
>> searching a very long string that happens not to contain any matches.

[Bryan Olson]
> So how general is that?

Between 6 and 7 ;-)

> The books just says "certain common expressions". I think the real
> answer is that one can unroll exactly the true regular expressions.

I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'

> The unrolling is essentially resolving the states of a DFA by hand.

Oh yes, for those who've studied the theoretical underpinnings.  Most
people have not.  Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works.  And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the  search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.

That explanation may not work for you, but it's effective for people
who haven't studied the theory here.  Such explanations also extend
naturally to gimmicks like backreferences,  which have no counterpart
in CompSci regexps.



More information about the Python-list mailing list