How do I get to *all* of the groups of an re search?

Bengt Richter bokr at oz.net
Fri Jan 10 20:20:49 EST 2003


On Fri, 10 Jan 2003 15:33:57 -0700, Andrew Dalke <adalke at mindspring.com> wrote:

>Kyler Laird wrote:
>> As it is, I am resigned to understanding that Python's re
>> module makes an arbitrary and undocumented decision to return
>> the last instance of a match for a group.  I'm embarrassed.
>
>It is documented, and behaves as documented.
>
>http://www.python.org/doc/current/lib/match-objects.html
>] If a group number is negative or larger than the number of
>] groups defined in the pattern, an IndexError exception is
>] raised. If a group is contained in a part of the pattern that did not 
>] match, the corresponding result is None. If a group is contained
>] in a part of the pattern that matched multiple times, the last
>] match is returned.
>
>As far as my research went, no standard regexp library could provide
>that sort of information.  They only give the last group which
>matched a pattern.
>
I was surprised to see the same substring apparently re-used though.
I had expected that whatever match was decided on "used up" the text
so that it could not be returned in another match pattern, but it looks
like the last-of-multiple-matches logic kicks in even when the multiple
match has already occurred in a single span. E.g.,

The original:
 >>> import re
 >>> text = 'foo foo1 foo2 bar bar1 bar2 bar3'
 >>> test_re = re.compile('([a-z]+)( \\1[0-9]+)+')
 >>> print test_re.findall(text)
 [('foo', ' foo2'), ('bar', ' bar3')]

Adding some parens:
 >>> test_re = re.compile('([a-z]+)(( \\1[0-9]+)+)')
 >>> print test_re.findall(text)
 [('foo', ' foo1 foo2', ' foo2'), ('bar', ' bar1 bar2 bar3', ' bar3')]

Why are foo2 and bar3 showing up twice each? E.g., why not '' in the last position?
Is that the way it is supposed to work? Just asking ;-)

[... snipping some other boggledystuff ...]
>
>It is working as documented.
>
>You can also solve this without regexps.
>
>> I certainly did not encounter a limitation with REs - I can
>> define the solution perfectly using an RE.  The problem is just
>> getting the Python re module to share its results.  Python's
>> broken re module doesn't make REs any less appropriate.
>
>Show me a module besides Martel which lets you get access to
>the parse tree.  I looked at about a dozen packages, read
>through Friedl's 1st edition book, and posted to various newsgroups
>looking for one.
>
>The problem is that the regexp defines a tree structure, but
>the interface to the parsers are linear, and there was a choice
>make (some time ago) to flatten that tree to only contain the
>last groups which match.
Even if they've already been matched?

Regards,
Bengt Richter




More information about the Python-list mailing list