[New-bugs-announce] [issue7132] Regexp: capturing groups in repetitions

Wed Oct 14 22:08:16 CEST 2009

New submission from Philippe Verdy <verdy_p at wanadoo.fr>:

For now, when capturing groups are used within repetitions, it is impossible to capure what they match 
individually within the list of matched repetitions.

E.g. the following regular expression:

(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3}

is a regexp that contains two capturing groups (\1 and \2), but whose the second one is repeated (3 times) to 
match an IPv4 address in dotted decimal format. We'd like to be able to get the individual multiple matchs 
for the second group.

For now, capturing groups don't record the full list of matches, but just override the last occurence of the 
capturing group (or just the first if the repetition is not greedy, which is not the case here because the 
repetition "{3}" is not followed by a "?"). So \1 will effectively return the first decimal component of the 
IPv4 address, but \2 will just return the last (fourth) decimal component.

I'd like to have the possibility to have a compilation flag "R" that would indicate that capturing groups 
will not just return a single occurence, but all occurences of the same group. If this "R" flag is enabled, 
then:

- the Match.group(index) will not just return a single string but a list of strings, with as many occurences 
as the number of effective repetitions of the same capturing group. The last element in that list will be the 
one equal to the current behavior

- the Match.start(index) and Match.end(index) will also both return a list of positions, those lists having 
the same length as the list returned by Match.group(index).

- for consistency, the returned values as lists of strings (instead of just single strings) will apply to all 
capturing groups, even if they are not repeated.

Effectively, with the same regexp above, we will be able to retreive (and possibily substitute):

- the first decimal component of the IPv4 address with "{\1:1}" (or "{\1:}" or "{\1}" or "\1" as before), 
i.e. the 1st (and last) occurence of capturing group 1, or in Match.group(1)[1], or between string positions Match.start(1)[1] and Match.end(1)[1] ;

- the second decimal component of the IPv4 address with "{\2:1}", i.e. the 1st occurence of capturing group 
2, or in Match.group(2)[1], or between string positions Match.start(2)[1] and Match.end(2)[1] ;

- the third decimal component of the IPv4 address with "{\2:2}", i.e. the 2nd occurence of capturing group 2, 
or in Match.group(2)[2], or between string positions Match.start(2)[2] and Match.end(2)[2] ;

- the fourth decimal component of the IPv4 address with "{\2:3}" (or "{\2:}" or "{\2}" or "\2"), i.e. the 3rd 
(and last) occurence of capturing group 2, or in Match.group(2)[2], or between string positions 
Match.start(2)[3] and Match.end(2)[3] ;

This should work with all repetition patterns (both greedy and not greedy, atomic or not, or possessive), in 
which the repeated pattern contains any capturing group.

This idea should also be submitted to the developers of the PCRE library (and Perl from which they originate, 
and PHP where PCRE is also used), so that they adopt a similar behavior in their regular expressions.

If there's already a candidate syntax or compilation flag in those libraries, this syntax should be used for 
repeated capturing groups.

----------
components: Library (Lib)
messages: 94022
nosy: verdy_p
severity: normal
status: open
title: Regexp: capturing groups in repetitions
type: feature request
versions: Python 3.2

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7132>
_______________________________________