`re' difficulty?

Tim Peters tim_one at email.msn.com
Tue Oct 19 03:31:51 EDT 1999


[François Pinard, with a regexp containing alternatives, one of which can
 match a proper prefix of what's matching by another]
> ...
> I presume the longest match does not apply here, as it was usual for
> me so far, whenever regular expressions are concerned.

Are you sure you used to hack Perl <0.7 wink>?  CPython and JPython both
strive to mimic Perl regexp behavior here.

> May I guess this is all implemented with backtracking, with the first
> matching alternative shadowing the remaining alternatives?

Right.

> Isn't that commiting Python to a behaviour prohibiting later
> optimisations?

Possibly.  OTOH, Python could "optimize" <wink> the regexp

    cat|catfish

to plain

    cat

since it's impossible for catfish to match without first matching cat.

> Or is the exact behaviour just undefined?  What's the story? :-)

Jeffrey Friedl's book "Mastering Regular Expressions" (O'Reilly) defines it
in excruciating detail.  In short, a match is found at the leftmost position
(in the target string) possible; after that, the leftmost (in the regexp)
matching alternative wins, even if an alternative to its right can match a
longer substring; after that, leftmost greed wins for greedy quantifiers,
even if a longer substring could be matched overall by being less greedy
(similarly for minimal quantifiers).  For example,

>>> re.match("ax*(xy)?", "axxxxxy").group(0)
'axxxxx'
>>>

despite that had x* consumed just xxxx the whole string would have matched.

the-right-way-the-wrong-way-and-the-perl-way-ly y'rs  - tim






More information about the Python-list mailing list