[Python-Dev] Behavior of matching backreferences

Tim Peters tim.one@comcast.net
Sun, 23 Jun 2002 14:28:53 -0400


[Gustavo Niemeyer, on the behavior of
    re.compile("^(?P<a>a)?(?P=a)$").match("ebc").groups()
]

Python and Perl work exactly the same way for the equivalent (but spellable
in Perl) regexp

    ^(a)?\1$

matching the two strings

    a
and
    aa

and nothing else.  That's what I expected.  You didn't give a concrete
example of what you think it should do instead.  It may have been your
intent to say that you believe the regexp *should* match the string

    ebc

but you didn't really say so one way or the other.  Regardless, neither
Python nor Perl do match ebc in this case, and that's intended.

The Rule, in vague English, is that a backreference matches the same text as
was matched by the referenced group; if the referenced group didn't match
any text, then the backreference can't match either.  Note that whether the
referenced group matched any text is a different question than whether the
referenced group is *used* in the match.  This is a subtle point I suspect
you're missing.

> Otherwise the regular expression above will allways fail if the first
> group fails,

Yes.

> even being optional

There's no such beast as "an optional group".  The

    ^(a)

part *must* match or the entire regexp fails, period, regardless of whether
or not backreferences appear later.  The question mark following doesn't
change this requirement.

    (a)?

says

    'a' must match
    but the overall pattern can choose to use this match or not

That's why the regexp as a whole matches the string

    a

The

    (a)

part does match 'a', the ? chooses not to use this match, and then the
backreference matches the 'a' that the first group matched.  Study the
output of this and it may be clearer:

import re
pat = re.compile(r"^((a)?)(\2)$")
print pat.match('a').groups()
print pat.match('aa').groups()


> ...
> while the regular expression above would match "aa" or "", but not "a".

As above, Python and Perl disagree with you:  they match "aa" and "a" but
not "".

> ...
> My intentions and the issue are clear enough.

Sorry, your intentions weren't clear to me.  The issue is, though <wink>.