how to get all repeated group with regular expression
scsoce
scsoce at gmail.com
Fri Nov 21 21:12:31 EST 2008
MRAB wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve
> Holden wrote:
>> Please keep this on the list.
>>
>> scsoce wrote:
>>> Steve Holden wrote:
>>>> scsoce wrote:
>>>>
>>>>> say, when I try to search and match every char from variable length
>>>>> string, such as string '123456', i tried re.findall( r'(\d)*,
>>>>> '12346' )
>>>>>
>>>> I think you will find you missed a quote out there. Always better to
>>>> copy and paste ...
>>>>
>>>>
>>>>> , but only get '6' and Python doc indeed say: "If a group is
>>>>> contained
>>>>> in a part of the pattern that matched multiple times, the last
>>>>> match is
>>>>> returned."
>>>>>
>>>> So use
>>>>
>>>> r'(\d*)'
>>>>
>>>> instead and then the group includes all the digits you match.
>>>>
>>>>
>>>>> cause the regx engine cannot remember all the past history then ?
>>>>> is it
>>>>> nature to all regx engine or only to Python ?
>>>>>
>>>> Different regex engines have different capabilities, so I can't
>>>> speak to
>>>> them all. If you wanted *all* the matches of *all* groups, how
>>>> would you
>>>> have them returned? As a list? That would make the case where there
>>>> was
>>>> only one match much tricker to handle. And what would you do with
>>>>
>>>> r'((\w)*\d)*)'
>>>>
>>>> Also, what about named groups? I can see enough potential
>>>> implementation
>>>> issues that I can perfectly understand why Python works the way it
>>>> does,
>>>> so I'd be interested to know why it doesn't makes sense to you, and
>>>> what
>>>> you would prefer it to do.
>>>>
>>>> regards
>>>> Steve
>>>>
>>> maybe my expression was not clear. I want to capture every matched
>>> part
>>> in a repeated pattern, not only the last, say, for string '123456', I
>>> want to back reference any one char, not only the '6'. and i know the
>>> example is very simple, so we can got the whole string using regx and
>>> get every char using other python statements, but if the pattern in
>>> group is complex?
>>> and I test in VIM, it can do the 'back reference':
>>> ==you text in vim:
>>> 123456
>>> == pattern:
>>> :%s/\(\d\)*/$2
>>> text will turn to be:
>>> 2
>>>
>> 'Fraid the Python re implementers just decided not to do it that way.
>>
> Nor Perl.
>
> Probably what you want is re.findall(r"(\d)", "123456"), which returns
> a list of what it captured.
>
>
> </div>
Yes, you are right, but this way findall() capture only the 'top' group.
What I really need to do is to capture nested and repated patterns, say,
<table> tag in html contains many <tr>, <tr> contains many <td>,
the data in <td> is i need, so I write the regx like this:
regx ='''
<table.*\n
(
(\s*<tr.*\n
(\s*<td.*</td>\n|\n)*
\s*</tr>\n
|\n)*
)
\s*</table>
'''
Steve Holden wrote:
> I can see enough potential implementation
> issues that I can perfectly understand why Python works the way it does,
> so I'd be interested to know why it doesn't makes sense to you, and what
> you would prefer it to do.
>
As Steve said, if re really cannot do this kind of work , so I have to
split the one line regx down, and capture <table> first, and then loop
to catpure <tr>, and then <td>, and so on ... . I donnot like this way
compared with the above one clean regx line.
More information about the Python-list
mailing list