When Good Regular Expressions Go Bad

Tim Peters tim_one at email.msn.com
Sun Oct 3 20:04:51 EDT 1999


[Skip Montanaro, continues to defend the indefensible <wink>]
> Yes, I should have given example re's (parental guidance suggested):
>
> First, the real thing:
>
>     r"(?P<m>\s*(?P<date>\d+/\d[-,\d]*/\d+),?"
>      "(?P<venue>[^,]+),"
>      "(?P<city>[^,]+),"
>      "(?P<state>[A-Za-z\s]+),"
>      "(?P<time>[^,]+),"
>      "(?P<info>.*))(?P<u>.*)"
>
> Then, prefix 1:
>
>     r"(?P<m>\s*(?P<date>\d+/\d[-,\d]*/\d+),?"
>      "(?P<venue>[^,]+),"
>      "(?P<city>[^,]+),"
>      "(?P<state>[A-Za-z\s]+))"
>      "(?P<u>.*)"
> [and so on]

My "slot" advice still applies:  you don't want the longest partial match if
the whole thing fails to click, you want to apply these subexpressions one
at a time!  Then you can tell the user exactly which part failed to meet the
requirements, in language they understand.  Like so:

    m = re.match(r"\s*(\d+/\d[-,\d]*/\d+),?", input)
    if not m:
        raise LameoUser("Enter a reasonable date, moron", input)
    date = m.group(1)
    input = input[m.end():]

    m = re.match(r"\s*([^,]+),", input)
    if not m:
        raise LameoUser("Enter a reasonable venue, moron", input)
    venue = m.group(1)
    input = input[m.end():]

    etc

This is still mechanical enough to push into a table-driven approach.

Get the longest partial match instead, and you have no idea which part of
the *regexp* it finally failed in, so can't do more than point at the
character following and say "umm, my software got confused right here, for
some reason -- maybe you can guess why if you stare at it long enough".

> ...
> I'd like to get rid of regular expressions altogether, but I don't have
> a better general pattern matcher at present and fewer resources to
> develop anything.  Consequently, I adapt what I have at hand.

I wouldn't get rid of regexps -- they're powerful lexical classifiers.  The
sin with regexps is hoping against both reason and evidence that "one big
one" can do the whole job, when the job is anything more demanding than
breaking out fields of input known a priori to match.  If the input may have
errors, or the regexp may be buggy (try to write one that isn't <wink>), use
them as building blocks in a more rational approach:  get as far as you can
with little ones, and gripe when that fails.

> ...
> That's because your pattern didn't match even the shortest pattern.  I had
> to stop somewhere.  I suppose I could have added a few more re's.

Chain N small ones, one per "field", and reasonable error reporting comes
for free.

> With Douglas's "return the matching prefix" suggestion I could have done
> it all with one regular expression.

I don't think you'd be happy with the quality of error msg you can produce
from that, because you can't relate the matching prefix in any way to "the
fields" the user knows about -- at least not without rematching the failing
input to shorter and shorter regexps again, to find out which part of the
*regexp* got confused.  The next suggestion will be to return not only the
matching prefix, but as many groups as "would have been" filled in had that
been the whole match.  Then it gets more complicated.

Douglas may have a real use for this (I didn't understand his problem
description), but I think this looks like "a solution" to you only because
you haven't tried it.

"syntax-error-in-column-63"-avoiding-ly y'rs  - tim






More information about the Python-list mailing list