how to get all repeated group with regular expression

Sat Nov 22 16:46:36 EST 2008

On Fri, Nov 21, 2008 at 9:12 PM, scsoce <scsoce at gmail.com> wrote:

> MRAB wrote:
>
>> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve Holden
>> wrote:
>>
>>> Please keep this on the list.
>>>
>>> scsoce wrote:
>>>
>>>> Steve Holden wrote:
>>>>
>>>>> scsoce wrote:
>>>>>
>>>>>
>>>>>> say, when I try to search and match every char  from variable length
>>>>>> string, such as string '123456',  i tried re.findall( r'(\d)*, '12346'
>>>>>> )
>>>>>>
>>>>>>
>>>>> I think you will find you missed a quote out there. Always better to
>>>>> copy and paste ...
>>>>>
>>>>>
>>>>>
>>>>>> , but only get '6' and Python doc indeed say: "If a group is contained
>>>>>> in a part of the pattern that matched multiple times, the last match
>>>>>> is
>>>>>> returned."
>>>>>>
>>>>>>
>>>>> So use
>>>>>
>>>>>    r'(\d*)'
>>>>>
>>>>> instead and then the group includes all the digits you match.
>>>>>
>>>>>
>>>>>
>>>>>> cause the regx engine cannot remember all the past history then ?  is
>>>>>> it
>>>>>> nature to all regx engine or only to Python ?
>>>>>>
>>>>>>
>>>>> Different regex engines have different capabilities, so I can't speak
>>>>> to
>>>>> them all. If you wanted *all* the matches of *all* groups, how would
>>>>> you
>>>>> have them returned? As a list? That would make the case where there was
>>>>> only one match  much tricker to handle. And what would you do with
>>>>>
>>>>>  r'((\w)*\d)*)'
>>>>>
>>>>> Also, what about named groups? I can see enough potential
>>>>> implementation
>>>>> issues that I can perfectly understand why Python works the way it
>>>>> does,
>>>>> so I'd be interested to know why it doesn't makes sense to you, and
>>>>> what
>>>>> you would prefer it to do.
>>>>>
>>>>> regards
>>>>>  Steve
>>>>>
>>>>>
>>>> maybe my expression was not clear. I  want to capture every matched part
>>>> in a repeated pattern, not only the last,  say, for string '123456',  I
>>>> want to back reference any one char, not only the '6'. and i know the
>>>> example is very simple, so we can got the whole string using regx and
>>>> get every char using other python statements, but if the pattern in
>>>> group is complex?
>>>> and I test in VIM, it can do the 'back reference':
>>>> ==you text in vim:
>>>> 123456
>>>> == pattern:
>>>> :%s/\(\d\)*/$2
>>>> text will turn to be:
>>>> 2
>>>>
>>>>  'Fraid the Python re implementers just decided not to do it that way.
>>>
>>>  Nor Perl.
>>
>> Probably what you want is re.findall(r"(\d)", "123456"), which returns a
>> list of what it captured.
>>
>>
>> </div>
>>
> Yes, you are right, but this way findall() capture only the 'top' group.
> What I really need to do is to capture nested and repated patterns, say,
> <table> tag in html contains many <tr>,  <tr>  contains many <td>,   the
>  data in <td>  is i need, so I write the regx like this:
>   regx ='''
>             <table.*\n
>              (
>              (\s*<tr.*\n
>                   (\s*<td.*</td>\n|\n)*
>               \s*</tr>\n
>              |\n)*
>              )
>              \s*</table>
>               '''
> Steve Holden wrote:
>
>> I can see enough potential implementation
>> issues that I can perfectly understand why Python works the way it does,
>> so I'd be interested to know why it doesn't makes sense to you, and what
>> you would prefer it to do.
>>
>>
>
> As Steve said, if re really cannot do this kind of work , so I have to
> split the one line regx down, and  capture <table> first, and then loop to
> catpure <tr>, and then <td>, and so on ... . I donnot like this way compared
> with the above one clean regx line.
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

If you're parsing structured markup like HTML, why not use something meant
for that? I personally find BeautifulSoup (
http://www.crummy.com/software/BeautifulSoup/) to be very good at this. For
instance, here's a code snippet I recently used to pull out specific data
from a table in a site:

soup = BeautifulSoup(some_page)
opts = [fonttag.string.strip()
           for row in soup('tr', attrs={'class':'targetClass'})
           for cell in row('td')
           for fonttag in cell('font')
           if cell('font')]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20081122/1077d1bc/attachment-0001.html>