regexp non-greedy matching bug?

Tim Peters tim.peters at gmail.com
Sun Dec 4 00:42:49 EST 2005


[John Hazen]
> I want to match one or two instances of a pattern in a string.
>
> According to the docs for the 're' module
> ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
> is greedy by default, and adding a '?' after a qualifier makes it
> non-greedy.

>> The "*", "+", and "?" qualifiers are all greedy...
>> Adding "?" after the qualifier makes it perform the match in
>> non-greedy or minimal fashion...

> In the following example, though my re is intended to allow for 1 or 2
> instinces of 'foo', there are 2 in the string I'm matching.  So, I would
> expect group(1) and group(3) to both be populated.  (When I remove the
> conditional match on the 2nd foo, the grouping is as I expect.)
>
> $ python2.4
> Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
> [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import re
> >>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')
> >>> foofoo.match(s).group(0)
> 'foobarbazfoobar'
> >>> foofoo.match(s).group(1)
> 'foo'
> >>> foofoo.match(s).group(2)
> ''
> >>> foofoo.match(s).group(3)
> >>> foofoo.match(s).group(4)
> 'barbazfoobar'

Your problem isn't that

    (foo)?

is not greedy (it is greedy), it's that your first

    (.*?)

is not greedy.  Remember that regexps also work left to right.  When
you coded your first

    (.*?)

you're asking (because of the '?') the regexp engine to chew up the
fewest possible number of characters at that point such that the
_rest_ of the regexp _can_ match.  By chewing up no characters at all,
the rest of the regexp can in fact match, so that's what the engine
did -- your second

   (foo)?

is optional, telling the engine you don't require that `foo` to match.
 The engine took you at your word about that ;-)

> >>> foofoo = re.compile(r'^(foo)(.*?)(foo)(.*?)$')

In this case your second `foo` is not optional.  The behavior wrt the first

    (.*?)

really doesn't change:  the regexp engine again chews up the fewest
number of characters at that point such that the rest of the regexp
can match.  But because your second `foo` isn't optional in this case,
the engine can't get away with matching 0 characters in this case.  It
still matches the fewest number it can match there, though (consistent
with the rest of the pattern matching too).

> >>> foofoo.match(s).group(0)
> 'foobarbazfoobar'
> >>> foofoo.match(s).group(1)
> 'foo'
> >>> foofoo.match(s).group(2)
> 'barbaz'
> >>> foofoo.match(s).group(3)
> 'foo'
> >>> foofoo.match(s).group(4)
> 'bar'
> >>>
>
> So, is this a bug, or just a problem with my understanding?

The behavior is what I expected ;-)

> If it's my brain that's broken, what's the proper way to do this with regexps?

Sorry, I'm unclear on (exactly) what it is you're trying to
accomplish.  Maybe what you're looking for is

    ^P(.*P)?.*$

?

> And, if the above is expected behavior, should I submit a doc bug?  It's
> clear that the "?" qualifier (applied to the second foo group) is _not_
> greedy in this situation.

See above:  that's not only not clear, it's not true.  Consider a
related but much simpler example:

>>> m = re.match(r'a(b)?(b)?c', 'abc')
>>> m.groups()
('b', None)

Both instances of

    (b)?

are "greedy" there, and that the second one didn't match "b" does not
mean that the second one is not greedy.  It _couldn't_ match without
violating that the _first_ is greedy, and the first "wins" because
regexps work left to right.  It may be harder to see that the same
principle is at work in your example, but it is:  your second (foo)?
couldn't match without violating that your first (.*?) asks for a
minimal match.  My

    ^P(.*P)?.*$

above asks the engine to match two instances of P if possible, but to
settle for one if that's all it can find.



More information about the Python-list mailing list