how to get all repeated group with regular expression

scsoce scsoce at gmail.com
Fri Nov 21 21:12:31 EST 2008


MRAB wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve 
> Holden wrote:
>> Please keep this on the list.
>>
>> scsoce wrote:
>>> Steve Holden wrote:
>>>> scsoce wrote:
>>>>  
>>>>> say, when I try to search and match every char  from variable length
>>>>> string, such as string '123456',  i tried re.findall( r'(\d)*, 
>>>>> '12346' )
>>>>>     
>>>> I think you will find you missed a quote out there. Always better to
>>>> copy and paste ...
>>>>
>>>>  
>>>>> , but only get '6' and Python doc indeed say: "If a group is 
>>>>> contained
>>>>> in a part of the pattern that matched multiple times, the last 
>>>>> match is
>>>>> returned."
>>>>>     
>>>> So use
>>>>
>>>>     r'(\d*)'
>>>>
>>>> instead and then the group includes all the digits you match.
>>>>
>>>>  
>>>>> cause the regx engine cannot remember all the past history then ?  
>>>>> is it
>>>>> nature to all regx engine or only to Python ?
>>>>>     
>>>> Different regex engines have different capabilities, so I can't 
>>>> speak to
>>>> them all. If you wanted *all* the matches of *all* groups, how 
>>>> would you
>>>> have them returned? As a list? That would make the case where there 
>>>> was
>>>> only one match  much tricker to handle. And what would you do with
>>>>
>>>>   r'((\w)*\d)*)'
>>>>
>>>> Also, what about named groups? I can see enough potential 
>>>> implementation
>>>> issues that I can perfectly understand why Python works the way it 
>>>> does,
>>>> so I'd be interested to know why it doesn't makes sense to you, and 
>>>> what
>>>> you would prefer it to do.
>>>>
>>>> regards
>>>>  Steve
>>>>   
>>> maybe my expression was not clear. I  want to capture every matched 
>>> part
>>> in a repeated pattern, not only the last,  say, for string '123456',  I
>>> want to back reference any one char, not only the '6'. and i know the
>>> example is very simple, so we can got the whole string using regx and
>>> get every char using other python statements, but if the pattern in
>>> group is complex?
>>> and I test in VIM, it can do the 'back reference':
>>> ==you text in vim:
>>> 123456
>>> == pattern:
>>> :%s/\(\d\)*/$2
>>> text will turn to be:
>>> 2
>>>
>> 'Fraid the Python re implementers just decided not to do it that way.
>>
> Nor Perl.
>
> Probably what you want is re.findall(r"(\d)", "123456"), which returns 
> a list of what it captured.
>
>
> </div>
Yes, you are right, but this way findall() capture only the 'top' group. 
What I really need to do is to capture nested and repated patterns, say, 
<table> tag in html contains many <tr>,  <tr>  contains many <td>,   
the  data in <td>  is i need, so I write the regx like this:
    regx ='''
              <table.*\n
               (
               (\s*<tr.*\n
                    (\s*<td.*</td>\n|\n)*
                \s*</tr>\n
               |\n)*
               )
               \s*</table>
                '''
Steve Holden wrote:
> I can see enough potential implementation
> issues that I can perfectly understand why Python works the way it does,
> so I'd be interested to know why it doesn't makes sense to you, and what
> you would prefer it to do.
>   

As Steve said, if re really cannot do this kind of work , so I have to 
split the one line regx down, and  capture <table> first, and then loop 
to catpure <tr>, and then <td>, and so on ... . I donnot like this way 
compared with the above one clean regx line.




More information about the Python-list mailing list