How do I get to *all* of the groups of an re search?

Andrew Dalke adalke at mindspring.com
Fri Jan 10 17:33:57 EST 2003


Kyler Laird wrote:
> As it is, I am resigned to understanding that Python's re
> module makes an arbitrary and undocumented decision to return
> the last instance of a match for a group.  I'm embarrassed.

It is documented, and behaves as documented.

http://www.python.org/doc/current/lib/match-objects.html
] If a group number is negative or larger than the number of
] groups defined in the pattern, an IndexError exception is
] raised. If a group is contained in a part of the pattern that did not 
] match, the corresponding result is None. If a group is contained
] in a part of the pattern that matched multiple times, the last
] match is returned.

As far as my research went, no standard regexp library could provide
that sort of information.  They only give the last group which
matched a pattern.

I ended up writing my own regexp engine (!), Martel, which is at
http://www.dalkescientific.com/Martel/ and based on mxTextTools.
(You may consider that package as well.)  Martel is part of the
Biopython package at http://www.biopython.org/ .

Here's what you wanted originally

	import re

	text = 'foo foo1 foo2 bar bar1 bar2 bar3'

	test_re = re.compile('([a-z]+)( \\1[0-9]+)+')

	print test_re.findall(text)

     I expected the matches to be something like
	[('foo', [' foo1', ' foo2']), ('bar', [' bar1', ' bar2',
              'bar3'])]
     but it's just this.
	[('foo', ' foo2'), ('bar', ' bar3')]

Here's how to do that in Martel.  Note that I had to tweak
the regexp slightly, since Martel requires the pattern to match
the full string.

Martel converts the match tree into an XML SAX stream, so I
wrote an XML ContentHandler since you want the output to be in a
special form.  Oh, and Martel uses named groups, not the unnamed
ones you have.  (I have patterns with hundreds of parens, and
also I use the regexp group names as XML element names.)

I start off by building the expression and making a Martel
parser out of it.  I then showing how it gets converted into
XML.  (The extra '; print' is for clarity.)  I then show how
to handle your specific case.

 >>> import Martel
 >>> pattern = Martel.Re(r"(?P<word>[a-z]+)( (?P<var>(?P=word)[0-9]+))+")
 >>> full_pattern = pattern + Martel.Rep(Martel.Str(" ") + pattern)
 >>>
 >>> parser = full_pattern.make_parser()
 >>> from xml.sax import saxutils
 >>> parser.setContentHandler(saxutils.XMLGenerator())
 >>> parser.parseString("foo foo1 foo2 bar bar1 bar2 bar3"); print
<?xml version="1.0" encoding="iso-8859-1"?>
<word>foo</word> <var>foo1</var> <var>foo2</var> <word>bar</word> 
<var>bar1</var> <var>bar2</var> <var>bar3</var>
 >>>
 >>>
 >>> class Capture(handler.ContentHandler):
...     def startDocument(self):
...             self.matches = []
...             self.save_chars = 0
...             self.text = ""
...     def startElement(self, name, attrs):
...             if name in ("var", "word"):
...                     self.text = ""
...                     self.save_chars = 1
...     def characters(self, s):
...             if self.save_chars:
...                     self.text += s
...     def endElement(self, name):
...             if name == "word":
...                     self.matches.append( (self.text, []) )
...                     self.save_chars = 0
...             if name == "var":
...                     self.matches[-1][-1].append(self.text)
...                     self.save_chars = 0
...
 >>> capture = Capture()
 >>> parser.setContentHandler(capture)
 >>> parser.parseString("foo foo1 foo2 bar bar1 bar2 bar3")
 >>>
 >>> capture.matches
[('foo', ['foo1', 'foo2']), ('bar', ['bar1', 'bar2', 'bar3'])]
 >>>


> At the very least, the documentation should be changed to say
> that only the last match of a group will be returned.  Better
> still would be an explanation of why the last one was chosen
> and how that makes Python's behavior more predictable.

It is documented.  And it is consistent with other regexp libs,
eg, I know Perl's works that way.  I have 2nd ed. of Friedl's
regexp book, but I haven't read it yet and I can't find where
he talks about it.  Still, this behaviour is highly consistent
with the other regexp packages.

> No, I *really* want the re module to work like it's documented.
> What will I do when I encounter a need to do something like this
> and it doesn't happen to be related to HTML?

It is working as documented.

You can also solve this without regexps.

> I certainly did not encounter a limitation with REs - I can
> define the solution perfectly using an RE.  The problem is just
> getting the Python re module to share its results.  Python's
> broken re module doesn't make REs any less appropriate.

Show me a module besides Martel which lets you get access to
the parse tree.  I looked at about a dozen packages, read
through Friedl's 1st edition book, and posted to various newsgroups
looking for one.

The problem is that the regexp defines a tree structure, but
the interface to the parsers are linear, and there was a choice
make (some time ago) to flatten that tree to only contain the
last groups which match.

					Andrew Dalke
					dalke at dalkescientific.com





More information about the Python-list mailing list