How do I get to *all* of the groups of an re search?

Andrew Dalke adalke at mindspring.com
Fri Jan 10 21:17:06 EST 2003


Bengt Richter wrote:
> I was surprised to see the same substring apparently re-used though.
> I had expected that whatever match was decided on "used up" the text
> so that it could not be returned in another match pattern, but it looks
> like the last-of-multiple-matches logic kicks in even when the multiple
> match has already occurred in a single span. E.g.,
   ...
> Adding some parens:
>  >>> test_re = re.compile('([a-z]+)(( \\1[0-9]+)+)')
>  >>> print test_re.findall(text)
>  [('foo', ' foo1 foo2', ' foo2'), ('bar', ' bar1 bar2 bar3', ' bar3')]
> 
> Why are foo2 and bar3 showing up twice each? E.g., why not '' in the last position?
> Is that the way it is supposed to work? Just asking ;-)

        '([a-z]+)(( \\1[0-9]+)+)')

    'foo foo1 foo2 bar bar1 bar2 bar3'

    'foo' matches ([a-z]+)  which is group 1
       ' foo1' matches ( \\1[0-9]+) which is group 3
       ' foo1' matches (( \\1[0-9]+)+) which is group 2
            ' foo2' matches ( \\1[0-9]+) which is group 3
       ' foo1 foo2' matches (( \\1[0-9]+)+) which is group 2
    'foo foo1 foo2' matches the whole pattern, so we stop

At this point we have the three groups as
    ('foo', ' foo1 foo2', ' foo2')

Expressed as a tree this is

          ([a-z]+)(( \\1[0-9]+)+)
          \------/\-------------/
            /            |
           /             |
          /              |
         |         (( \\1[0-9]+)+)
         |             Group 2
         |            /       \
         |           /         \
      ([a-z]+)  (\\1[0-9]+)  (\\1[0-9]+)
       Group 1   Group 3      Group 3
         |         |            |
       [a-z]+    \\1[0-9]+   \\1[0-9]+
         |         |            |
       'foo'    ' foo1'      ' foo2'

            'foo foo1 foo2'

Traverse this tree from left to right.  When there is a
'Group', use the text underneath it for the value of the
group.  Last group wins.  In this case we have
    Group 1 == 'foo'
    Group 2 == ' foo1 foo2'
    Group 3 == ' foo1'
    Group 3 == ' foo2' which replaces ' foo1'

(In Martel this would correspond to
   startDocument()
   startElement("Group 1", {})
   characters("foo")
   endElement("Group 1")
   startElement("Group 2", {})
   startElement("Group 3", {})
   characters(" foo1")
   endElement("Group 3")
   startElement("Group 3", {})
   characters(" foo2")
   endElement("Group 3", {})
   endElement("Group 2")
   endDocument()
and is mapped to the XML
   <Group1>foo</Group1><Group2><Group3> foo1</Group3><Group3> 
foo2</Group3></Group2>
)

Getting back to the example ...

Since this is a 'findall' we start again, searching for the
next match after the end of 'foo foo1 foo2'.  The next character
is a space, and doesn't work.  So we start with the 'bar'


    'foo foo1 foo2 bar bar1 bar2 bar3'
                  'bar' matches ([a-z]+) which is group 1
                     ' bar1' matches ( \\1[0-9]+) which is group 3
                     ' bar1' matches (( \\1[0-9]+)+) which is group 2
                          ' bar2' matches ( \\1[0-9]+) which is group 3
                     ' bar1 bar2' matches (( \\1[0-9]+)+) = group2
                               ' bar3' is group 3
                     ' bar1 bar2 bar3' is group 2

At this point we have reach the end of the string and the
regexp is complete, so we stop with the three groups as

    ('bar', ' bar1 bar2 bar3', ' bar3')


The ASCII tree diagram is left as an exercise.

Does that make sense?

>>The problem is that the regexp defines a tree structure, but
>>the interface to the parsers are linear, and there was a choice
>>make (some time ago) to flatten that tree to only contain the
>>last groups which match.
> 
> Even if they've already been matched?

I'm sorry, I don't understand your question.

					Andrew
					dalke at dalkescientific.com





More information about the Python-list mailing list