regexp non-greedy matching bug?

Sun Dec 4 00:50:19 EST 2005

John Hazen <john at hazen.net> writes:
> I want to match one or two instances of a pattern in a string.

Then you should be using the split() method of the match object on the
pattern in question.

> According to the docs for the 're' module 
> ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
> is greedy by default, and adding a '?' after a qualifier makes it
> non-greedy.
>
>> The "*", "+", and "?" qualifiers are all greedy...
>> Adding "?" after the qualifier makes it perform the match in
>> non-greedy or minimal fashion...

The thing to understand is that regular expressions are *search*
functions, that return the first parsing that matches. They search a
space of possible matches to each term in the expression. If some term
fails to match, the preceeding term goes on to its next match, and you
try again. The "greedy" vs. "non-greedy" describes the order that the
term in question tries matches. If it's greedy, it will try the
longest possible match first. If it's non-greedy, it'll try the
shortest possible match first.

> $ python2.4
> Python 2.4.1 (#2, Mar 31 2005, 00:05:10) 
> [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import re
>>>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')
>>>> foofoo.match(s).group(0)
> 'foobarbazfoobar'
>>>> foofoo.match(s).group(1)
> 'foo'
>>>> foofoo.match(s).group(2)
> ''
>>>> foofoo.match(s).group(3)
>>>> foofoo.match(s).group(4)
> 'barbazfoobar'

First, this pattern doesn't look for one or two instances of "foo" in
a string. It looks for a string that starts with "foo" and maybe has a
second "foo" in it as well.

Here's what it does:

^ must match the beginning of the string (BTW, you can get the same
behavior by leaving off the ^ and using search instead of match).

(foo) must match the first foo.

(.*?) is non-greedy, so it tries the shortest thing that matches, the
empty string.

(foo)? matches what's next in the string by matching 0 occurrences of
(foo).

(.*?) tries the empty string as well.

$ fails to match the end of the string, so we backtrack, and the
second (.*?) tries again, adding a character. This keeps on until the
(.*?) matches the rest of the string and the $ succeeds.

Here, the only pattern that gets retried is the last one. It keeps
adding characters until it swallows the remainder of the string. This
is exactly what's expected. You can't change this behavior with
greedy/non-greedy, as that last term will always try searching the
entire string before it fails and the preceeding "(foo)?" gets to try
something else.

>>>> foofoo = re.compile(r'^(foo)(.*?)(foo)(.*?)$')
>>>> foofoo.match(s).group(0)
> 'foobarbazfoobar'
>>>> foofoo.match(s).group(1)
> 'foo'
>>>> foofoo.match(s).group(2)
> 'barbaz'
>>>> foofoo.match(s).group(3)
> 'foo'
>>>> foofoo.match(s).group(4)
> 'bar'
>>>>

This time, the second (foo) keeps failing, forcing the first (.*?) to
add characters until the (foo) succeeds.

> So, is this a bug, or just a problem with my understanding?  If it's my
> brain that's broken, what's the proper way to do this with regexps?

To do what you said you want to do, you want to use the split method:

foo = re.compile('foo')
if 2 <= len(foo.split(s)) <= 3:
   print "We had one or two 'foo's"

As the founder of SPARE, I'd say the proper way to do this is with
string methods:

if 2 <= len(s.split('foo')) <= 3:
   print "We had one or two 'foo's"

Where the two split's will produce exactly the same things.

You might be able to match "starts with 'foo' and has at most one
other 'foo'" with lookahead patternds, but I'm not going to go into
that.  If you really need the string to start with foo, I'd just use
"s.startswith('foo')" to check it.

           <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.