[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching

Wed Aug 11 05:27:21 CEST 2004

"Fredrik Lundh" <fredrik at pythonware.com> writes:
> Mike Coleman wrote:
> > If I may wax anthropomorphic, the 're.match' function says to me as a programmer
> >
> >    *You* know what structure this RE represents, and *I* know what
> >    structure it represents, too, because I had to figure it out to
> >    do the match.
> 
> that only shows that you dont understand how regular expressions work.

Ouch.

> a regular expression defines a set of strings, and the RE engine is designed to
> tell you if a string you have is a member of this set.

Yes, I agree.  In particular, the re.match function "proves" that a particular
regular expression pattern matches a string or that it does not.  In the
process of doing that, each subexpression in the pattern must be matched
against some (possibly empty) substring of the string.  The re.match function
has all of this knowledge implicitly and ephemerally, and for parenthesized
groups, it even remembers this information and allows the caller access to it.

Now given all that, doesn't it seem odd that I can get complete information
about concatenated subexpressions

    >>> re.match(r'(...)(...)', 'abcdef').groups()
    ('abc', 'def')

and alternating subexpressions

    >>> re.match(r'(a..)|(z..)', 'abcdef').groups()
    ('abc', None)

but not repeating subexpressions

    >>> re.match(r'(..)*', 'abcdef').groups()
    ('ef',)

?  What's the logic of that?  That's the limitation I'd like to remove.

Now, I understand that doing this will be less efficient, which is why this is
a separate method and might not be able to share as much implementation with
re.match as would otherwise be desirable.  But I still think this is a gap
that really needs to be filled.

So, is my proposal the best way to do it?  Not necessarily.  I'd love to see a
better alternative.

> the engine's not a parser, and it has a very vague notion of "structure"
> (groups are implemented by "marks", which are registers that keeps track of
> where the engine last passed a given position; changing that to "lists of
> all possible matches" would require a major rewrite).

I was thinking that the modifications necessary wouldn't be that bad, since
the search algorithm itself is not being disturbed.  I could be wrong of
course.

(Note that we're not saving "all possible matches", just the multiple matches
involved in the successful match of a repeat group.)

> you're probably better off using the scanner mechanism:
> 
>     http://effbot.org/zone/xml-scanner.htm
> 
> or the sre.Scanner class (see the source code).  the scanner concept could
> need some documentation, and it would be nice to be able to switch patterns
> during parsing.  (time for a scanner PEP, anyone?)

I saw this but thought the code was dead.  An example or two of usage in a
docstring would be worth its weight in gold.  It looks like a useful
feature--I'd be +1 for it.

I'm not sure straightaway how to use the scanner mechanism to implement
re.structmatch, though.  I'll look at it.

Mike