how to get all repeated group with regular expression
MRAB
google at mrabarnett.plus.com
Sat Nov 22 10:22:28 EST 2008
scsoce wrote:
> MRAB wrote:
>> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve
>> Holden wrote:
>>> Please keep this on the list.
>>>
>>> scsoce wrote:
>>>> Steve Holden wrote:
>>>>> scsoce wrote:
>>>>>
>>>>>> say, when I try to search and match every char from variable length
>>>>>> string, such as string '123456', i tried re.findall( r'(\d)*,
>>>>>> '12346' )
>>>>>>
>>>>> I think you will find you missed a quote out there. Always better to
>>>>> copy and paste ...
>>>>>
>>>>>
>>>>>> , but only get '6' and Python doc indeed say: "If a group is
>>>>>> contained
>>>>>> in a part of the pattern that matched multiple times, the last
>>>>>> match is
>>>>>> returned."
>>>>>>
>>>>> So use
>>>>>
>>>>> r'(\d*)'
>>>>>
>>>>> instead and then the group includes all the digits you match.
>>>>>
>>>>>
>>>>>> cause the regx engine cannot remember all the past history then ?
>>>>>> is it
>>>>>> nature to all regx engine or only to Python ?
>>>>>>
>>>>> Different regex engines have different capabilities, so I can't
>>>>> speak to
>>>>> them all. If you wanted *all* the matches of *all* groups, how
>>>>> would you
>>>>> have them returned? As a list? That would make the case where there
>>>>> was
>>>>> only one match much tricker to handle. And what would you do with
>>>>>
>>>>> r'((\w)*\d)*)'
>>>>>
>>>>> Also, what about named groups? I can see enough potential
>>>>> implementation
>>>>> issues that I can perfectly understand why Python works the way it
>>>>> does,
>>>>> so I'd be interested to know why it doesn't makes sense to you, and
>>>>> what
>>>>> you would prefer it to do.
>>>>>
>>>>> regards
>>>>> Steve
>>>>>
>>>> maybe my expression was not clear. I want to capture every matched
>>>> part
>>>> in a repeated pattern, not only the last, say, for string '123456', I
>>>> want to back reference any one char, not only the '6'. and i know the
>>>> example is very simple, so we can got the whole string using regx and
>>>> get every char using other python statements, but if the pattern in
>>>> group is complex?
>>>> and I test in VIM, it can do the 'back reference':
>>>> ==you text in vim:
>>>> 123456
>>>> == pattern:
>>>> :%s/\(\d\)*/$2
>>>> text will turn to be:
>>>> 2
>>>>
>>> 'Fraid the Python re implementers just decided not to do it that way.
>>>
>> Nor Perl.
>>
>> Probably what you want is re.findall(r"(\d)", "123456"), which returns
>> a list of what it captured.
>>
>>
>> </div>
> Yes, you are right, but this way findall() capture only the 'top' group.
> What I really need to do is to capture nested and repated patterns, say,
> <table> tag in html contains many <tr>, <tr> contains many <td>,
> the data in <td> is i need, so I write the regx like this:
> regx ='''
> <table.*\n
> (
> (\s*<tr.*\n
> (\s*<td.*</td>\n|\n)*
> \s*</tr>\n
> |\n)*
> )
> \s*</table>
> '''
> Steve Holden wrote:
>> I can see enough potential implementation
>> issues that I can perfectly understand why Python works the way it does,
>> so I'd be interested to know why it doesn't makes sense to you, and what
>> you would prefer it to do.
>>
>
> As Steve said, if re really cannot do this kind of work , so I have to
> split the one line regx down, and capture <table> first, and then loop
> to catpure <tr>, and then <td>, and so on ... . I donnot like this way
> compared with the above one clean regx line.
>
Why not capture just the "<td>" entries?
If you want to know when it's starting a new table or row then how about:
re.compile(r'(<table\b|<tr\b|<td[^<]*)')
and re.findall() or re.finditer()?
If what was captured starts with "<table>" then it's the start of a new
table; if what was captured starts with "<tr" then it's the start of a
new row; if what was captured starts with "<td" then it's an entry.
More information about the Python-list
mailing list