When Good Regular Expressions Go Bad

Fri Oct 1 11:59:07 EDT 1999

>>>>> "Tim" == Tim Peters <tim_one at email.msn.com> writes:

    Tim> I don't see how this relates to Douglas's idea, unless perhaps the
    Tim> two "shorter ones" can match only proper prefixes of strings
    Tim> matched by "the real regexp", and also one of the other.

Yes, I should have given example re's (parental guidance suggested):

First, the real thing:

    r"(?P<m>\s*(?P<date>\d+/\d[-,\d]*/\d+),?"
     "(?P<venue>[^,]+),"
     "(?P<city>[^,]+),"
     "(?P<state>[A-Za-z\s]+),"
     "(?P<time>[^,]+),"
     "(?P<info>.*))(?P<u>.*)"

Then, prefix 1:

    r"(?P<m>\s*(?P<date>\d+/\d[-,\d]*/\d+),?"
     "(?P<venue>[^,]+),"
     "(?P<city>[^,]+),"
     "(?P<state>[A-Za-z\s]+))"
     "(?P<u>.*)"

Then, prefix 2:

    r'(?P<m>\s*\d+/\d[-,\d]*/\d+,\s+[^,]+)(?P<u>.*)'

    Tim> Of course people will enter highly structured data incorrectly, and
    Tim> of course you want to help them get it straight.  Those are two
    Tim> reasons not to use regexps at all <0.1 wink>.  0.9 seriously!  I
    Tim> entered:

    Tim>     3/40 @ 7:30pm, Knitting Factory, New York, NY (with Doctor
    Tim>     Nerve/Meridian Arts Ensemble

Known problem.  I'd like to get rid of regular expressions altogether, but I
don't have a better general pattern matcher at present and fewer resources
to develop anything.  Consequently, I adapt what I have at hand.

    Tim> and it got rejected, without a clue as to why.  Error detection and
    Tim> recovery is a Real Pain even with a Real Parser; a mechanical trick
    Tim> like "report rightmost progress" isn't going to turn the infinitely
    Tim> feebler regexp gimmick into a solution.

That's because your pattern didn't match even the shortest pattern.  I had
to stop somewhere.  I suppose I could have added a few more re's.  With
Douglas's "return the matching prefix" suggestion I could have done it all
with one regular expression.

    Tim> What you have here is a classical "frame" problem: a number of
    Tim> information "slots" that need to be filled in.  You picked a rigid
    Tim> format to ease your own implementation, perhaps because you've been
    Tim> hoodwinked into believing that "a regexp" *should* be "the
    Tim> solution".  Well, it isn't.

Perhaps I asked for this by not presenting more background in my previous
message.  I adapted an existing solution that currently handles schedules
from a couple thousand different web pages and email submissions in a very
wide range of formats (I adapt to the format I find instead of asking them
to adapt to me).  In almost all cases, I am able to parse the input with a
slightly higher level notation:

    %{smonth}/%{days}/%{syear}\* %{st} %{venue}/%{city}%{?\(w/%{performer+}\)}%{?=%{time}=}%{?/%{info}}

The above "compiles" into a very messy regular expression that begins

    [0-9]+/[0-9]+([-,][0-9]+)*/[0-9]+\*\s+[A-Za-z.]+\s+[^/]+/[-'A-Za-z \t.]+...

Are regular expressions the best solution in the restricted environment I
first described?  Not likely.  In the broader environment I didn't
originally describe they work fairly well (and I'm the only one who has to
generate them).

Skip