How do I get to *all* of the groups of an re search?
Andrew Dalke
adalke at mindspring.com
Fri Jan 10 21:17:06 EST 2003
Bengt Richter wrote:
> I was surprised to see the same substring apparently re-used though.
> I had expected that whatever match was decided on "used up" the text
> so that it could not be returned in another match pattern, but it looks
> like the last-of-multiple-matches logic kicks in even when the multiple
> match has already occurred in a single span. E.g.,
...
> Adding some parens:
> >>> test_re = re.compile('([a-z]+)(( \\1[0-9]+)+)')
> >>> print test_re.findall(text)
> [('foo', ' foo1 foo2', ' foo2'), ('bar', ' bar1 bar2 bar3', ' bar3')]
>
> Why are foo2 and bar3 showing up twice each? E.g., why not '' in the last position?
> Is that the way it is supposed to work? Just asking ;-)
'([a-z]+)(( \\1[0-9]+)+)')
'foo foo1 foo2 bar bar1 bar2 bar3'
'foo' matches ([a-z]+) which is group 1
' foo1' matches ( \\1[0-9]+) which is group 3
' foo1' matches (( \\1[0-9]+)+) which is group 2
' foo2' matches ( \\1[0-9]+) which is group 3
' foo1 foo2' matches (( \\1[0-9]+)+) which is group 2
'foo foo1 foo2' matches the whole pattern, so we stop
At this point we have the three groups as
('foo', ' foo1 foo2', ' foo2')
Expressed as a tree this is
([a-z]+)(( \\1[0-9]+)+)
\------/\-------------/
/ |
/ |
/ |
| (( \\1[0-9]+)+)
| Group 2
| / \
| / \
([a-z]+) (\\1[0-9]+) (\\1[0-9]+)
Group 1 Group 3 Group 3
| | |
[a-z]+ \\1[0-9]+ \\1[0-9]+
| | |
'foo' ' foo1' ' foo2'
'foo foo1 foo2'
Traverse this tree from left to right. When there is a
'Group', use the text underneath it for the value of the
group. Last group wins. In this case we have
Group 1 == 'foo'
Group 2 == ' foo1 foo2'
Group 3 == ' foo1'
Group 3 == ' foo2' which replaces ' foo1'
(In Martel this would correspond to
startDocument()
startElement("Group 1", {})
characters("foo")
endElement("Group 1")
startElement("Group 2", {})
startElement("Group 3", {})
characters(" foo1")
endElement("Group 3")
startElement("Group 3", {})
characters(" foo2")
endElement("Group 3", {})
endElement("Group 2")
endDocument()
and is mapped to the XML
<Group1>foo</Group1><Group2><Group3> foo1</Group3><Group3>
foo2</Group3></Group2>
)
Getting back to the example ...
Since this is a 'findall' we start again, searching for the
next match after the end of 'foo foo1 foo2'. The next character
is a space, and doesn't work. So we start with the 'bar'
'foo foo1 foo2 bar bar1 bar2 bar3'
'bar' matches ([a-z]+) which is group 1
' bar1' matches ( \\1[0-9]+) which is group 3
' bar1' matches (( \\1[0-9]+)+) which is group 2
' bar2' matches ( \\1[0-9]+) which is group 3
' bar1 bar2' matches (( \\1[0-9]+)+) = group2
' bar3' is group 3
' bar1 bar2 bar3' is group 2
At this point we have reach the end of the string and the
regexp is complete, so we stop with the three groups as
('bar', ' bar1 bar2 bar3', ' bar3')
The ASCII tree diagram is left as an exercise.
Does that make sense?
>>The problem is that the regexp defines a tree structure, but
>>the interface to the parsers are linear, and there was a choice
>>make (some time ago) to flatten that tree to only contain the
>>last groups which match.
>
> Even if they've already been matched?
I'm sorry, I don't understand your question.
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list