How to get the "longest possible" match with Python's RE module?

Licheng Fang fanglicheng at gmail.com
Tue Sep 12 05:08:50 EDT 2006


Bryan Olson wrote:
> Licheng Fang wrote:
> > Oh, please do have a look at the second link I've posted. There's a
> > table comparing the regexp engines. The engines you've tested probably
> > all use an NFA implementation.
>
> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
> book was the first time I saw this confusion about what NFA is;
> I don't know if he originated the mess or just propagated it.
>
> "Nondeterministic finite automata" is well defined in theory
> of computation. The set of languages accepted by NFA's is
> exactly the same as the set accepted by DFA's.
>
> What Python uses is search-and-backtrack. Unfortunately such
> engines don't have much theory behind them, and it's hard to
> reason generally about what they do.
>
>
> --
> --Bryan

Thanks for the valuable information. Indeed, when I read the pages, I
was a little confused about what it meant by 'NFA'.

But I faintly felt, there might be an indirect, and not not very exact
mapping from the search-and-backtrack strategy to NFAs in the sense of
computer science, e.g. a state machine with the capability to be in
several states at one time.

Say, when reading 'oneselfsufficient', the engine goes along the NFA
first to state 1, and then faces the choice between

        one         self, none        selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

1) matching 'self',
2) going on to state 2 without matching anything, or
3) just give 'one' as the matching result because state 1 is already a
terminal state

In such situations it always chooses the greedy way. To match more, it
goes to the state 2, munching 'self'. And now it's left with only
'sufficient'. Here it's choices are:

1) matching nothing and going to state 3
2) just give 'oneself' as result because state 2 is also a terminal
state

Again it's greedy, going on to state 3, in hope of matching more. But
here the pattern comes to an end, represented by state 3 as a terminal
state. So the engine gives 'oneself' as result and forgets about its
previously unexplored possibilities, because it only performs backtrack
when a match cannot be found.

I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.




More information about the Python-list mailing list