how to get all repeated group with regular expression

Sat Nov 22 10:22:28 EST 2008

scsoce wrote:
> MRAB wrote:
>> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve 
>> Holden wrote:
>>> Please keep this on the list.
>>>
>>> scsoce wrote:
>>>> Steve Holden wrote:
>>>>> scsoce wrote:
>>>>>  
>>>>>> say, when I try to search and match every char  from variable length
>>>>>> string, such as string '123456',  i tried re.findall( r'(\d)*, 
>>>>>> '12346' )
>>>>>>     
>>>>> I think you will find you missed a quote out there. Always better to
>>>>> copy and paste ...
>>>>>
>>>>>  
>>>>>> , but only get '6' and Python doc indeed say: "If a group is 
>>>>>> contained
>>>>>> in a part of the pattern that matched multiple times, the last 
>>>>>> match is
>>>>>> returned."
>>>>>>     
>>>>> So use
>>>>>
>>>>>     r'(\d*)'
>>>>>
>>>>> instead and then the group includes all the digits you match.
>>>>>
>>>>>  
>>>>>> cause the regx engine cannot remember all the past history then ?  
>>>>>> is it
>>>>>> nature to all regx engine or only to Python ?
>>>>>>     
>>>>> Different regex engines have different capabilities, so I can't 
>>>>> speak to
>>>>> them all. If you wanted *all* the matches of *all* groups, how 
>>>>> would you
>>>>> have them returned? As a list? That would make the case where there 
>>>>> was
>>>>> only one match  much tricker to handle. And what would you do with
>>>>>
>>>>>   r'((\w)*\d)*)'
>>>>>
>>>>> Also, what about named groups? I can see enough potential 
>>>>> implementation
>>>>> issues that I can perfectly understand why Python works the way it 
>>>>> does,
>>>>> so I'd be interested to know why it doesn't makes sense to you, and 
>>>>> what
>>>>> you would prefer it to do.
>>>>>
>>>>> regards
>>>>>  Steve
>>>>>   
>>>> maybe my expression was not clear. I  want to capture every matched 
>>>> part
>>>> in a repeated pattern, not only the last,  say, for string '123456',  I
>>>> want to back reference any one char, not only the '6'. and i know the
>>>> example is very simple, so we can got the whole string using regx and
>>>> get every char using other python statements, but if the pattern in
>>>> group is complex?
>>>> and I test in VIM, it can do the 'back reference':
>>>> ==you text in vim:
>>>> 123456
>>>> == pattern:
>>>> :%s/\(\d\)*/$2
>>>> text will turn to be:
>>>> 2
>>>>
>>> 'Fraid the Python re implementers just decided not to do it that way.
>>>
>> Nor Perl.
>>
>> Probably what you want is re.findall(r"(\d)", "123456"), which returns 
>> a list of what it captured.
>>
>>
>> </div>
> Yes, you are right, but this way findall() capture only the 'top' group. 
> What I really need to do is to capture nested and repated patterns, say, 
> <table> tag in html contains many <tr>,  <tr>  contains many <td>,   
> the  data in <td>  is i need, so I write the regx like this:
>    regx ='''
>              <table.*\n
>               (
>               (\s*<tr.*\n
>                    (\s*<td.*</td>\n|\n)*
>                \s*</tr>\n
>               |\n)*
>               )
>               \s*</table>
>                '''
> Steve Holden wrote:
>> I can see enough potential implementation
>> issues that I can perfectly understand why Python works the way it does,
>> so I'd be interested to know why it doesn't makes sense to you, and what
>> you would prefer it to do.
>>   
> 
> As Steve said, if re really cannot do this kind of work , so I have to 
> split the one line regx down, and  capture <table> first, and then loop 
> to catpure <tr>, and then <td>, and so on ... . I donnot like this way 
> compared with the above one clean regx line.
> 
Why not capture just the "<td>" entries?

If you want to know when it's starting a new table or row then how about:

     re.compile(r'(<table\b|<tr\b|<td[^<]*)')

and re.findall() or re.finditer()?

If what was captured starts with "<table>" then it's the start of a new 
table; if what was captured starts with "<tr" then it's the start of a 
new row; if what was captured starts with "<td" then it's an entry.