Must be a bug in the re module [was: Why this result with the re module]

Tue Nov 2 21:41:55 EDT 2010

> From: John Bond <lists at asd-group.com>
> Subject: Re: Why this result with the re module
> To: "Yingjie Lan" <lanyjie at yahoo.com>
> Cc: python-list at python.org
> Date: Tuesday, November 2, 2010, 8:09 PM
> On 2/11/2010 12:19 PM, Yingjie Lan
> wrote:
> >> From: John Bond<lists at asd-group.com>
> >> Subject: Re: Why this result with the re module
> > Firstly, thanks a lot for your patient explanation.
> > this time I have understood all your points
> perfectly.
> >
> > Secondly, I'd like to clarify some of my points,
> which
> > did not get through because of my poor presentation.
> >
> > I suggested findall return a tuple of
> re.MatchObject(s),
> > with each MatchObject instance representing a match.
> > This is consistent with the re.match() function
> anyway.
> > And it will eliminate the need of returning tuples,
> > and it is much more precise and information rich.
> >
> > If that's not possible, and a tuple must be returned,
> > then the whole match (not just subgroups) should
> > always be included as the first element in the tuple,
> > as that's group(0) or '\0'. Less surprise would
> arise.
> >
> > Finally, it seems to me the algo for findall is
> WRONG.
> >
> > To re.findall('(.a.)*', 'Mary has a lamb'),
> > by reason of greediness of '*', and the requirement
> > of non-overlapping, it should go like this
> > (suppose an '' is at the beginning and at the end,
> > and between two consecutive characters there is
> > one and only one empty string ''. To show the
> > match of empty strings clearly,
> > I am concatenating each repeated match below):
> >
> > Steps for re.findall('(.a.)*', 'Mary has a lamb'):
> >
> > step 1: Match '' + 'Mar' + '' (gready!)
> > step 2: skip 'y'
> > step 3: Match ''
> > step 4: skip ' '
> > step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
> > step 6: skip 'b'
> > step 7: Match ''
> >
> > So there should be exactly 4 matches in total:
> >
> > 'Mar', '', 'has a lam', ''
> >
> > Also, the matches above shows
> > that if a repeated subgroup only captures
> > the last match, the subgroup (.a.)*
> > should always capture '' here (see steps
> > 1, 3, 5, 7) above.
> >
> > Yet the execution in Python results in 6 matches!
> > And, the capturing subgroup with repetition
> > sometimes got the wrong guy.
> >
> > So I believe the algorithm for findall must be WRONG.
> >
> > Regards,
> >
> > Yingjie
> At a guess, I'd say what is happening is something like
> this:
> 
> Steps for re.findall('(.a.)*', 'Mary has a lamb'):
> 
> step 1: Match 'Mar' at string index 0
> step 2: Match '' at string index 3 (before 'y')
> step 3: skip 'y'
> step 4: Match '' at string index 4 (before ' ')
> step 5: skip ' '
> step 6: Match 'has a lam' at string index 5
> step 7: Match '' at string index 14 (before 'b')
> step 8: skip 'b'
> step 9: Match '' at string index 15 (before EOS)
> <EOS reached>
> 
> 
> matches: ('Mar', '', '', 'has a lam', '', '')
> returns: ['Mar', '', '', 'lam', '', ''] (*)
> 
> (*) "has a " lost due to not being last repetition at that
> match point
> 
> Which seems about right to me! Greediness has nothing to do
> with it, except that it causes 'has a lam' to be matched in
> one match, instead of as three separate matches (of 'has', '
> a ' and 'lam').

Disagree in this case, where the whole regex 
matches an empty string. Greadiness will match
as much as possible. So it will also match
the empty strings between consecutive 
characters as much as possible, once
we have properly defined all the unique
empty strings. Because of greadiness, 
fewer matches should be found. In this
case, it should find only 4 matches 
(shown in my steps) instead of 6 matches
(shown in your steps).

Yingjie