Understanding '?' in regular expressions

Fri Nov 16 03:15:24 EST 2012

On Fri, Nov 16, 2012 at 12:28 AM,  <krishna.k.kishor3 at gmail.com> wrote:
> Can someone explain the below behavior please?
>
>>>> re1 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]?[ ]*?){1,3}')
>>>> re.findall(re_obj,'1000,1020,1000')
> ['1000']
>>>> re.findall(re_obj,'1000,1020, 1000')
> ['1020', '1000']

Try removing the grouping parentheses to see the full strings being matched:

>>> re1 = re.compile(r'(?:(?:1000|1010|1020)[ ]*?[\,]?[ ]*?){1,3}')
>>> re.findall(re1,'1000,1020,1000')
['1000,1020,1000']
>>> re.findall(re1,'1000,1020, 1000')
['1000,1020,', '1000']

In the first case, the regular expression is matching the full string.
 It could also match shorter expressions, but as only the space
quantifiers are non-greedy and there are no spaces to match anyway, it
does not.  With the grouping parentheses in place, only the *last*
value of the group is returned, which is why you only see the last
'1000' instead of all three strings in the group, even though the
group is actually matching three different substrings.

In the second case, the regular expression finds first the '1000,1020'
and then the '1000' as two separate matches.  The reason for this is
the space.  Since the quantifier on the space is non-greedy, it first
tries *not* matching the space, finds that it has a valid match, and
so does not backtrack.  The '1000' is then identified as a separate
match.  As before, with the grouping parentheses in place you see only
the '1020' and the last '1000' because the group only reports the last
substring it matched for that particular match.

> However when I use "[\,]??" instead of "[\,]?" as below, I see a different result
>>>> re2 = re.compile(r'(?:((?:1000|1010|1020))[ ]*?[\,]??[ ]*?){1,3}')
>>>> re.findall(re_obj,'1000,1020,1000')
> ['1000', '1020', '1000']
>
> I am not able to understand what's causing the difference of behavior here, I am assuming it's not 'greediness' if "?"

The difference is the non-greediness of the comma quantifier.  When it
comes time for it to match the comma, because the quantifier is
non-greedy, it first tries *not* matching the comma, whereas before it
first tried to match it.  As with the space above, when the comma is
not matched, it finds that it has a valid match anyway if it just
stops matching immediately.  So it does not need to backtrack, and in
this case it ends up terminating each match early upon the comma and
returning all three numbers as separate matches.

What exactly is it that you're trying to do with this regular
expression?  I suspect that it the solution actually a lot simpler
than you're making it.