How do I get to *all* of the groups of an re search?
Andrew Dalke
adalke at mindspring.com
Fri Jan 10 17:33:57 EST 2003
Kyler Laird wrote:
> As it is, I am resigned to understanding that Python's re
> module makes an arbitrary and undocumented decision to return
> the last instance of a match for a group. I'm embarrassed.
It is documented, and behaves as documented.
http://www.python.org/doc/current/lib/match-objects.html
] If a group number is negative or larger than the number of
] groups defined in the pattern, an IndexError exception is
] raised. If a group is contained in a part of the pattern that did not
] match, the corresponding result is None. If a group is contained
] in a part of the pattern that matched multiple times, the last
] match is returned.
As far as my research went, no standard regexp library could provide
that sort of information. They only give the last group which
matched a pattern.
I ended up writing my own regexp engine (!), Martel, which is at
http://www.dalkescientific.com/Martel/ and based on mxTextTools.
(You may consider that package as well.) Martel is part of the
Biopython package at http://www.biopython.org/ .
Here's what you wanted originally
import re
text = 'foo foo1 foo2 bar bar1 bar2 bar3'
test_re = re.compile('([a-z]+)( \\1[0-9]+)+')
print test_re.findall(text)
I expected the matches to be something like
[('foo', [' foo1', ' foo2']), ('bar', [' bar1', ' bar2',
'bar3'])]
but it's just this.
[('foo', ' foo2'), ('bar', ' bar3')]
Here's how to do that in Martel. Note that I had to tweak
the regexp slightly, since Martel requires the pattern to match
the full string.
Martel converts the match tree into an XML SAX stream, so I
wrote an XML ContentHandler since you want the output to be in a
special form. Oh, and Martel uses named groups, not the unnamed
ones you have. (I have patterns with hundreds of parens, and
also I use the regexp group names as XML element names.)
I start off by building the expression and making a Martel
parser out of it. I then showing how it gets converted into
XML. (The extra '; print' is for clarity.) I then show how
to handle your specific case.
>>> import Martel
>>> pattern = Martel.Re(r"(?P<word>[a-z]+)( (?P<var>(?P=word)[0-9]+))+")
>>> full_pattern = pattern + Martel.Rep(Martel.Str(" ") + pattern)
>>>
>>> parser = full_pattern.make_parser()
>>> from xml.sax import saxutils
>>> parser.setContentHandler(saxutils.XMLGenerator())
>>> parser.parseString("foo foo1 foo2 bar bar1 bar2 bar3"); print
<?xml version="1.0" encoding="iso-8859-1"?>
<word>foo</word> <var>foo1</var> <var>foo2</var> <word>bar</word>
<var>bar1</var> <var>bar2</var> <var>bar3</var>
>>>
>>>
>>> class Capture(handler.ContentHandler):
... def startDocument(self):
... self.matches = []
... self.save_chars = 0
... self.text = ""
... def startElement(self, name, attrs):
... if name in ("var", "word"):
... self.text = ""
... self.save_chars = 1
... def characters(self, s):
... if self.save_chars:
... self.text += s
... def endElement(self, name):
... if name == "word":
... self.matches.append( (self.text, []) )
... self.save_chars = 0
... if name == "var":
... self.matches[-1][-1].append(self.text)
... self.save_chars = 0
...
>>> capture = Capture()
>>> parser.setContentHandler(capture)
>>> parser.parseString("foo foo1 foo2 bar bar1 bar2 bar3")
>>>
>>> capture.matches
[('foo', ['foo1', 'foo2']), ('bar', ['bar1', 'bar2', 'bar3'])]
>>>
> At the very least, the documentation should be changed to say
> that only the last match of a group will be returned. Better
> still would be an explanation of why the last one was chosen
> and how that makes Python's behavior more predictable.
It is documented. And it is consistent with other regexp libs,
eg, I know Perl's works that way. I have 2nd ed. of Friedl's
regexp book, but I haven't read it yet and I can't find where
he talks about it. Still, this behaviour is highly consistent
with the other regexp packages.
> No, I *really* want the re module to work like it's documented.
> What will I do when I encounter a need to do something like this
> and it doesn't happen to be related to HTML?
It is working as documented.
You can also solve this without regexps.
> I certainly did not encounter a limitation with REs - I can
> define the solution perfectly using an RE. The problem is just
> getting the Python re module to share its results. Python's
> broken re module doesn't make REs any less appropriate.
Show me a module besides Martel which lets you get access to
the parse tree. I looked at about a dozen packages, read
through Friedl's 1st edition book, and posted to various newsgroups
looking for one.
The problem is that the regexp defines a tree structure, but
the interface to the parsers are linear, and there was a choice
make (some time ago) to flatten that tree to only contain the
last groups which match.
Andrew Dalke
dalke at dalkescientific.com
More information about the Python-list
mailing list