[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching

Tue Aug 10 03:38:08 CEST 2004

"Stephen J. Turnbull" <stephen at xemacs.org> writes:
> >>>>> "Mike" == Mike Coleman <mkc at mathdogs.com> writes:
>     Mike>     m0 = re.match(r'([A-Z]+|[a-z]+)*', 'XxxxYzz')

> Sure, but regexp syntax is a horrible way to express that.

Do you mean, horrible compared to spelling it out using a Python loop that
walks through the array, or horrible compared to some more serious parsing
package?

For the former, I would disagree.  I see code like this a lot and it drives me
crazy.  Reminds me of the bad old days of building 'while' loops out of 'if's
and 'goto's.

For the latter, I think it depends on the complexity of the matching, and the
level of effort required to learn and distribute the "not-included" parsing
package.  I certainly wouldn't want to see someone try to write a language
front-end with this, but for a lot of text-scraping activities, I think it
would be very useful.

> This feature would be an attractive nuisance, IMHO.

I agree that, like list comprehensions (for example), it needs to be applied
with good judgement.

Turning it around, though, why *shouldn't* there be a good mechanism for
returning the multiple matches for multiply matching groups?  Why should this
be an exception?  If you agree that there should be a mechanism, it certainly
doesn't have to be the one in the PEP, but what would be better?  I'd welcome
alternative ideas here.

>     Mike>     p = r'((?:(?:^|:)([^:\n]*))*\n)*\Z'
> 
> This is a _easy_ one, but even it absolutely requires being written
> with (?xm) and lots of comments, don't you think?

I think it's preferable--that's why I did it.  :-)

> If you're going to be writing a multiline, verbose regular expression, why
> not write a grammar instead, which (assuming a modicum of library support)
> will be shorter and self-documenting?

If there were a suitable parsing package in the standard library, I agree that
this would probably be a lot less useful.

As things stand right now, though, it's a serious irritation that we have a
standard mechanism that *almost* does this, but quits at the last moment.  If
I may wax anthropomorphic, the 're.match' function says to me as a programmer

    *You* know what structure this RE represents, and *I* know what
    structure it represents, too, because I had to figure it out to 
    do the match.  But too bad, sucker, I'm not going to tell you what
    I found!

Irritating as hell.

Mike