Regular expression (re) anomaly with group names and alternatives

Tim Peters tim_one at email.msn.com
Thu Jul 8 23:24:06 EDT 1999


[Bob Alexander]
> It would be nice (IMO) if group names could be duplicated in different
> alternatives (sections separated by the | operator) of a regular
> expression, with the name taking on the value in the alternative that
> matches (if any).  However, that doesn't work -- is that a bug or by
> design?

I believe it's a bug, but for a different reason <wink>:  the re compiler
should raise an exception when a group name is repeated.  Group names are
simply symbolic names for numbered groups:

>>> p = re.compile("(?P<a>a1)(?P<b>b)(?P<a>a3)")
>>> p.groupindex
{'b': 2, 'a': 3}
>>>

"a" started life as a symbolic name for group 1, but the re compiler
silently changed it to a symbolic name for group 3 (forgetting about group
1) when it saw the 2nd occurrence of the name.  Since this is surprising (&,
I believe, also an accident), it should complain.

Note that, in general, and as illustrated by the specific pattern above,
there's no general sense to be made of a repeated name.  If that pattern
matched, would you want group "a" to refer to string "a1" because that
matched "first"?  To "a3" because that matched "last"?  To their catenation?
To a list of matching substrings?

When there's not a *clear* meaning for something, Python generally declines
to make one up.

not-that-there's-a-clear-meaning-for-any-regexp<wink>-ly y'rs  - tim






More information about the Python-list mailing list