Why this result with the re module

John Bond lists at asd-group.com
Tue Nov 2 06:23:10 EDT 2010


On 2/11/2010 8:53 AM, Yingjie Lan wrote:
>
> BUT, but.
>
> 1. I expected findall to find matches of the whole
> regex '(.a.)+', not just the subgroup (.a.) from
>>>> re.findall('(.a.)+', 'Mary has a lamb')
> Thus it is probably a misunderstanding/bug??

Again, as soon as you put a capturing group in your expression, you 
change the nature of what findall returns as described in the 
documentation. It then returns what gets assigned to each capturing 
group, not what chunk of text was matched by the whole expression at 
each matching point in the string.

A capturing group returns what was matched by the regex fregment *inside 
it*. If you put repetition *outside it* (as you have - "(.a.)*+*") that 
doesn't change but, if the repetition clause results in it being matched 
multiple times, only the last match is returned as the capturing groups 
single, only allowed return value.

I find that strange, and limiting (why not return a list of all matches 
caused by the repetition?) but that's the way it is.

Have you read the "Regular Exp[ression HOWTO" in the docs? It explains 
all this stuff.

> 2. Here is an statement from the documentation on
>     non-capturing groups:
>     see http://docs.python.org/dev/howto/regex.html
>
> "Except for the fact that you can’t retrieve the
> contents of what the group matched, a non-capturing
> group behaves exactly the same as a capturing group; "
In terms of how the regular expression works when matching text, which 
is what the above is addressing, that's true.  In terms of how the 
results are returned to API callers, it isn't true.

>     Thus, I'm again confused, despite of your
>     previous explanation. This might be a better
>     explanation: when a subgroup is repeated, it
>     only captures the last repetition.

That's true, but it's not related to the above.

> 3. It would be convenient to have '(*...)' for
>     non-capturing groups -- but of course, that's
>     only a remote suggestion.

Fair enough - each to their own preferences.

> 4. By reason of greediness of '*', and the concept
> of non-overlapping, it should go like this for
>     re.findall('((.a.)*)', 'Mary has a lamb')
>
> step 1: Match 'Mar' + '' (gready!)
> step 2: skip 'y'
> step 3: Match ''
> step 4: skip ' '
> step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
> step 7: skip 'b'
> step 8: Match ''
>
> So there should be 4 matches in total:
>
> 'Mar', '', 'has a lam', ''
>
> Also, if a repeated subgroup only captures
> the last repetition, the repeated
> subgroup (.a.)* should always be ''.
>
> Yet the execution in Python results in 6 matches.
>
> .....

All you have done is wrapped one of your earlier regexes,  '*(*.a.*)**' 
in another, outer capturing group, to make '*(*(.a.)**)*'. This doesn't 
change what is actually matched, so there are still the same six matches 
found. However it does change what is *returned *- you now have two 
capturing groups that findall has to return information about (at each 
match), so you will see that it returns 6 tuples (each with two items - 
one for each capturing group) instead of six strings, ie:

re.findall('(.a.)*', 'Mary has a lamb')

['Mar', '', '', 'lam', '', '']

becomes:

re.findall('((.a.)*)', 'Mary has a lamb')

[('Mar', 'Mar'), ('', ''), ('', ''), ('has a lam', 'lam'), ('', ''), 
('', '')]

As you can see, the top set of results appear in the bottom set (in the 
second item in each tuple, because the original capturing group is the 
second one now - the new, outer one is the first).

If you look at the fourth tuple, ('has a lam', 'lam'), you can see the 
"capturing group with repetition only returns the last match" rule in 
action. The inner capturing group (which has repetition) returns 'lam' 
because that was the last occurrence of ".a." in the three ("has", " a 
", "lam") that it matched that time. However the outer capturing group, 
which doesn't have repetition, returns the whole thing ('has a lam').

> Finally, The name findall implies all matches
> should be returned, whether there are subgroups in
> the pattern or not. It might be best to return all
> the match objects (like a re.match call) instead
> of the matched strings. Then there is no need
> to return tuples of subgroups. Even if tuples
> of subgroups were to be returned, group(0) must
> also be included in the returned tuple.
>
> Regards,
>
> Yingjie
>
>
>
All matches are returned by findall, so I don't understand that.

I really do suggest that you read the above-mentioned HOWTO, or any of 
the numerous tutorials on the net. Regexes are hard to get your head 
around at first, not helped by a few puzzling API design choices, but 
it's worth the effort, and those will be far  more useful than lots of 
typed explanations here.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20101102/0f998736/attachment-0001.html>


More information about the Python-list mailing list