regexp non-greedy matching bug?

Mike Meyer mwm at mired.org
Sun Dec 4 06:46:08 EST 2005


John Hazen <john at hazen.net> writes:
>> To do what you said you want to do, you want to use the split method:
>> 
>> foo = re.compile('foo')
>> if 2 <= len(foo.split(s)) <= 3:
>>    print "We had one or two 'foo's"
>
> Well, this would solve my dumbed down example, but each foo in the
> original expression was a stand-in for a more complex term. 

That actually doesn't matter. Just replace 'foo' with your more
complex term.

>>> foo2 = re.compile(r'foo(\d+)bar')

> I was using
> match groups to extract the parts of the match that I wanted.  Here's an
> example (using Tim's correction) that actually demonstrates what I'm
> doing:
>>>> s = 'zzzfoo123barxxxfoo456baryyy'
>>>> s2 = 'zzzfoo123barxxxfooyyy'
>>>> foobar2 = re.compile(r'^.*?foo(\d+)bar(.*foo(\d+)bar)?.*$')
>>>> print foobar2.match(s).group(1)
> 123
>>>> print foobar2.match(s).group(3)
> 456

>>> foo2.split('zzzfoo123barxxxfoo456baryyy')
['zzz', '123', 'xxx', '456', 'yyy']
>>> 

>>>> print foobar2.match(s2).group(1)
> 123
>>>> print foobar2.match(s2).group(3)
> None
>>>> 

>>> foo2.split('zzzfoo123barxxxfooyyy')
['zzz', '123', 'xxxfooyyy']

> Looking at re.split, it doesn't look like it returns the actual matching
> text, so I don't think that fits my need.

split() returns the text matched by groups in the pattern used to do
the split, and is documented as doing so. The solution you gave
doesn't return "the actual matching text" for the instances, but just
the text in the groups inside that text - which is exactly what
split() returns.

While on that topic, I'll note that the solution Tim gave you doesn't
solve the problem as I originally understood it, either. You said you
wanted to match one or two instances, which I read as only one or two
instances, so that more than two instances would be treated as a
failure. On rereading it, I can see where I was wrong.

>> As the founder of SPARE...
> Hmm, not a very effective name.  A google search didn't fing any obvious
> hits (even after adding the "python" qualifier, and removing "spare time"
> and "spare parts" hits).  (I couldn't find it off your homepage,
> either.)

That's the Society for the Prevention of Abuse of Regular
Expressions. One of these days, my proof-reader will get back to me
and I'll add a link about it to my home page.

    <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list