[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching

Thu Aug 12 06:25:38 CEST 2004

"Stephen J. Turnbull" <stephen at xemacs.org> writes:
> I didn't have that in mind, but I can live with it.  Regexp doesn't
> compare well even with that when you consider error-checking and
> program maintenance.

It's true that regexp matching is pretty much a binary thing--either it does
match or it doesn't.  For human generated files, this isn't so good, but my
domain of interest is machine-generated files, where I "know" that the pattern
will match.  If the match fails in this case, I'm happy enough to just detect
it and bail, possibly with an error message that says "looks like there's a
problem somewhere around char x in line y".

Re maintenance, yeah regexp is pretty terse and ugly.  Generally, though, I'd
rather deal with a reasonably well-considered 80 char regexp than 100 lines of
code that does the same thing.

> It's not obvious to me how to make grammar rules pretty in Python, but
> implementing an SLR parser-generator should be at most a couple of
> days' work to get something you could live with.

There are several of these packages available and I'm all in favor of their
use, particularly for problems of sufficient complexity.  These are currently
hindered, though, by the fact that none have been elected for inclusion in the
standard library.

Furthermore, since we have 're', and it's not going away, I'd really like to
fix this repeated match deficiency, one way or another.

> I agree it would be useful.  I think that a parser generator would be
> equally useful for this, as easy to learn as regexps are (assuming
> somebody comes up with a tasteful way to express grammar rules), and
> more powerful with even a basic implementation.

I will be +1 in favor if someone wants to pursue it.

> BTW, do you have a sample implementation for re.structmatch?  Fredrik
> seems to have some doubt about ease of implementation.

Erik Heneryd posted an intriguing Python prototype, which I'm still looking
at.

>     Mike> I agree that, like list comprehensions (for example), it
>     Mike> needs to be applied with good judgement.
> 
> I don't see the analogy to list comprehensions.

Just that although list comprehensions are cool, you can still make an
unreadable mess with them, as I've found myself doing occasionally.  (nesting
complex comprehensions four levels deep, etc.)

> My objection is that it throws away a lot of structure, and therefore
> is liable to return the _wrong_ parse, or simply an error with no hint
> as to where the data is malformed.

Hmm.  Regarding the lack of error diagnosis, I'm not too concerned about this,
for the reason I mention above.  When 're.structmatch' does fail, though, it
returns a "farthest match position", which will usually be of some aid, I
would think.

Regarding getting the parse wrong, sure, this could happen.  Users will have
to be judicious.  The answer here is to write RE's with some care, realize
that some matches may require further checking after being returned by
re.structmatch, and further realize that some parsing problems are too complex
for this method and require a grammar-based approach instead.  This doesn't
really seem worse than the situation for re.match, though, to me.

> Because they're all too often not "multiple matches".  They're matches
> for different objects that happen to match the same regular
> expression, and the notation for the parser should express that
> without requiring comments.  re.structmatch _encourages_ a style where
> users deliberately throw away structure; your /etc/group parser is an
> example.  (See below.)

Hmm.  You're quibbling with my example.  Yes, I could have written the
expression to "know" that there were only four fields and that the third was a
number, etc.  I chose not to for pedagogical purposes, but it could easily
have been done the other way.  And in practice the programmer can choose the
specificity of the match depending on exactly what they're trying to
accomplish.  Is this really different than what would happen with grammars?

> Furthermore, since this is going to be recursive, users are going to
> end up with "list of list of ..." nested to arbitrary depth, depending
> on how complex the regexp is.  Isn't that what we have Lisp for? <wink>

The shape of the result follows directly from the shape of the grouping in the
RE, yes.  This is construed as a benefit.  And yes, Lisp's backtick (? it's
been a while) expressions were part of what inspired me on this.

> I suppose you would be satisfied if you could represent
> 
> file := file line
> line := group ':' passwd ':' number ':' any '\n'
> group := '[A-Za-z0-9]+'
> passwd := '[A-Za-z0-9]+'
> number := '[0-9]+'
> any := '.*'
> 
> by
> 
> '(([A-Za-z0-9]+:)*.*\n)*'

This isn't really fair.  If you know that that's what you want to parse, then
you could use a more appropriate re.structmatch equivalent like

    p = r'''(?xm)           # VERBOSE, MULTILINE
            (
              (?P<group> [A-Za-z0-9]+)
              (?P<passwd> [A-Za-z0-9]+)
              (?P<number> [0-9]+)
              (?P<any> .*)
              \n
            )*
            \Z              # assert that the entire
                            #   input was matched
         '''

which will return the submatches tagged by those group names.

> improving 'passwd' to check for the characteristic signature of a
> crypted password or a "no password authentication" flag would be
> annoying with re.structmatch, while it's trivial with a parser.

If this signature can be expressed as an RE, then I suspect it would be easy
enough to add.  If not, then probably you'd have to check with code after the
fact, even if you're using a grammar parser.  Or am I missing something?

> I grant that a full-blown parser generator is overkill for your target
> set of applications.  But I stick with my judgement that regexps as
> they are are an attractive nuisance, and re.structmatch would make the
> situation worse in proportion to its usage.  I hope your proposal will
> be superseded either by a scanner package, or even by a parser package.

I think you're saying that you just don't care for regexps, period.  This
reminds me of jwz's quip ("now you have two problems"), and I have some
sympathy for this point of view, but practically speaking 're' is here and
it's not going away any decade soon.  Given that, I'd like to fix this
multiple match deficiency.

I'm in favor of the scanner package, but it doesn't cover the same problem
space as re.structmatch, so I don't see it as a replacement.  A parser package
*might* be, but I don't see any sign that that's coming any time soon.  (I'd
be happy enough to be wrong about this.)

Cheers,
Mike