How do I get to *all* of the groups of an re search?

Andrew Dalke adalke at
Fri Jan 10 17:33:57 EST 2003

Kyler Laird wrote:
> As it is, I am resigned to understanding that Python's re
> module makes an arbitrary and undocumented decision to return
> the last instance of a match for a group.  I'm embarrassed.

It is documented, and behaves as documented.
] If a group number is negative or larger than the number of
] groups defined in the pattern, an IndexError exception is
] raised. If a group is contained in a part of the pattern that did not 
] match, the corresponding result is None. If a group is contained
] in a part of the pattern that matched multiple times, the last
] match is returned.

As far as my research went, no standard regexp library could provide
that sort of information.  They only give the last group which
matched a pattern.

I ended up writing my own regexp engine (!), Martel, which is at and based on mxTextTools.
(You may consider that package as well.)  Martel is part of the
Biopython package at .

Here's what you wanted originally

	import re

	text = 'foo foo1 foo2 bar bar1 bar2 bar3'

	test_re = re.compile('([a-z]+)( \\1[0-9]+)+')

	print test_re.findall(text)

     I expected the matches to be something like
	[('foo', [' foo1', ' foo2']), ('bar', [' bar1', ' bar2',
     but it's just this.
	[('foo', ' foo2'), ('bar', ' bar3')]

Here's how to do that in Martel.  Note that I had to tweak
the regexp slightly, since Martel requires the pattern to match
the full string.

Martel converts the match tree into an XML SAX stream, so I
wrote an XML ContentHandler since you want the output to be in a
special form.  Oh, and Martel uses named groups, not the unnamed
ones you have.  (I have patterns with hundreds of parens, and
also I use the regexp group names as XML element names.)

I start off by building the expression and making a Martel
parser out of it.  I then showing how it gets converted into
XML.  (The extra '; print' is for clarity.)  I then show how
to handle your specific case.

 >>> import Martel
 >>> pattern = Martel.Re(r"(?P<word>[a-z]+)( (?P<var>(?P=word)[0-9]+))+")
 >>> full_pattern = pattern + Martel.Rep(Martel.Str(" ") + pattern)
 >>> parser = full_pattern.make_parser()
 >>> from xml.sax import saxutils
 >>> parser.setContentHandler(saxutils.XMLGenerator())
 >>> parser.parseString("foo foo1 foo2 bar bar1 bar2 bar3"); print
<?xml version="1.0" encoding="iso-8859-1"?>
<word>foo</word> <var>foo1</var> <var>foo2</var> <word>bar</word> 
<var>bar1</var> <var>bar2</var> <var>bar3</var>
 >>> class Capture(handler.ContentHandler):
...     def startDocument(self):
...             self.matches = []
...             self.save_chars = 0
...             self.text = ""
...     def startElement(self, name, attrs):
...             if name in ("var", "word"):
...                     self.text = ""
...                     self.save_chars = 1
...     def characters(self, s):
...             if self.save_chars:
...                     self.text += s
...     def endElement(self, name):
...             if name == "word":
...                     self.matches.append( (self.text, []) )
...                     self.save_chars = 0
...             if name == "var":
...                     self.matches[-1][-1].append(self.text)
...                     self.save_chars = 0
 >>> capture = Capture()
 >>> parser.setContentHandler(capture)
 >>> parser.parseString("foo foo1 foo2 bar bar1 bar2 bar3")
 >>> capture.matches
[('foo', ['foo1', 'foo2']), ('bar', ['bar1', 'bar2', 'bar3'])]

> At the very least, the documentation should be changed to say
> that only the last match of a group will be returned.  Better
> still would be an explanation of why the last one was chosen
> and how that makes Python's behavior more predictable.

It is documented.  And it is consistent with other regexp libs,
eg, I know Perl's works that way.  I have 2nd ed. of Friedl's
regexp book, but I haven't read it yet and I can't find where
he talks about it.  Still, this behaviour is highly consistent
with the other regexp packages.

> No, I *really* want the re module to work like it's documented.
> What will I do when I encounter a need to do something like this
> and it doesn't happen to be related to HTML?

It is working as documented.

You can also solve this without regexps.

> I certainly did not encounter a limitation with REs - I can
> define the solution perfectly using an RE.  The problem is just
> getting the Python re module to share its results.  Python's
> broken re module doesn't make REs any less appropriate.

Show me a module besides Martel which lets you get access to
the parse tree.  I looked at about a dozen packages, read
through Friedl's 1st edition book, and posted to various newsgroups
looking for one.

The problem is that the regexp defines a tree structure, but
the interface to the parsers are linear, and there was a choice
make (some time ago) to flatten that tree to only contain the
last groups which match.

					Andrew Dalke
					dalke at

More information about the Python-list mailing list