Why this result with the re module

Yingjie Lan lanyjie at yahoo.com
Tue Nov 2 04:53:17 EDT 2010


> From: John Bond <lists at asd-group.com>

> You might wonder why something that can match no input
> text, doesn't return an infinite number of those matches at
> every possible position, but they would be overlapping, and
> findall explicitly says matches have to be non-overlapping.

That scrabbed my itches, though the notion of overlapping
empty strings is quite interesting in itself. Obviously 
we have to assume there is one and only one empty string
between two consecutive characters.

Now I slightly modified my regex, and it suddenly looks
self-explanatory:
 
>>> re.findall('((.a.)+)', 'Mary has a lamb')
[('Mar', 'Mar'), ('has a lam', 'lam')]
>>> re.findall('((.a.)*)', 'Mary has a lamb')
[('Mar', 'Mar'), ('', ''), ('', ''), ('has a lam', 'lam'), ('', ''), ('', '')]

BUT, but.

1. I expected findall to find matches of the whole
regex '(.a.)+', not just the subgroup (.a.) from 
>>> re.findall('(.a.)+', 'Mary has a lamb')
Thus it is probably a misunderstanding/bug??

2. Here is an statement from the documentation on 
   non-capturing groups:
   see http://docs.python.org/dev/howto/regex.html

"Except for the fact that you can’t retrieve the 
contents of what the group matched, a non-capturing 
group behaves exactly the same as a capturing group; "

   Thus, I'm again confused, despite of your 
   previous explanation. This might be a better
   explanation: when a subgroup is repeated, it
   only captures the last repetition.

3. It would be convenient to have '(*...)' for 
   non-capturing groups -- but of course, that's
   only a remote suggestion.

4. By reason of greediness of '*', and the concept 
of non-overlapping, it should go like this for
   re.findall('((.a.)*)', 'Mary has a lamb')

step 1: Match 'Mar' + '' (gready!)
step 2: skip 'y'
step 3: Match ''
step 4: skip ' '
step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
step 7: skip 'b'
step 8: Match ''

So there should be 4 matches in total:

'Mar', '', 'has a lam', ''

Also, if a repeated subgroup only captures 
the last repetition, the repeated 
subgroup (.a.)* should always be ''.

Yet the execution in Python results in 6 matches.

Here is the documentation of re.findall:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    
    If one or more groups are present in the pattern, return a
    list of groups; this will be a list of tuples if the pattern
    has more than one group.
    
    Empty matches are included in the result.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Thus from
>>> re.findall('(.a.)*', 'Mary has a lamb')
I should get this result 
[('',), ('',), ('',), ('',)]


Finally, The name findall implies all matches 
should be returned, whether there are subgroups in 
the pattern or not. It might be best to return all
the match objects (like a re.match call) instead 
of the matched strings. Then there is no need
to return tuples of subgroups. Even if tuples 
of subgroups were to be returned, group(0) must
also be included in the returned tuple.

Regards,

Yingjie 


      



More information about the Python-list mailing list