[Python-ideas] Re module repeat

Mon Aug 1 05:57:05 CEST 2011

On Mon, Aug 1, 2011 at 10:56 AM, Christopher King <g.nius.ck at gmail.com> wrote:
>
>
> On Sun, Jul 31, 2011 at 8:41 PM, Devin Jeanpierre <jeanpierreda at gmail.com>
> wrote:
>>
>> Could you elaborate on the change? I don't understand your
>> modification. The regex is a different one than the original, as well.
>
> What do you mean by elaborate on the change. You mean explain. I guess I
> could do it in more detail.

By elaborate on the change, I expect Devin meant a more accurate
description of the problem you're trying to solve without the
confusing and irrelevant noise about named groups. Specifically:

>>> match=re.search('^([a-z])*$', 'abcz')
>>> match.groups()
('z',)

You're asking for '*' and '+' to change the group numbers based on the
number of matches that actually occur. This is untenable, which should
become clear as soon as another group is placed after the looping
constructs:

>>> match=re.search('^([a-y])*(.*)$', 'abcz')
>>> match.groups()
('c', 'z')

Group names/numbers are assigned when the regex is compiled. They
cannot be affected by runtime information based on the string being
processed.

The way to handle this (while still using the re module to do the
parsing) is multi-level parsing:

>>> match=re.search('^([a-z]*)$', 'abcz')
>>> relevant = match.group(0)
>>> pattern = re.compile('([a-z])')
>>> for match in pattern.finditer(relevant):
...   print(match.groups())
...
('a',)
('b',)
('c',)
('z',)

There's no reason to try to embed the functionality of finditer() into
the regex itself (and it's utterly impractical to do so anyway).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia