When Good Regular Expressions Go Bad

Tim Peters tim_one at email.msn.com
Thu Sep 30 23:05:51 EDT 1999


[Douglas Alan]
> It seems to me that even when a regular expression fails to match a
> string, you might want to know just how far it was able to
> get before getting stuck.

[Tim sez probably easy to add, but seems of dubious utility]

[Skip Montanaro]
> Take a look at
>
>     http://www.musi-cal.com/fast-itineraries.shtml

OK, I have.

> I want to let people set up schedule submissions and give them some help
> when they format their schedules incorrectly.  Currently I use three
> regular rexpressions per selected format, the one that will match a
> correctly formatted line and two shorter ones.  When the real thing fails
> I use the shorter ones to try and give the user some idea of where they
> might have gone astray.  Of course, users aren't told that regular
> expressions underly the pattern matcher.

I don't see how this relates to Douglas's idea, unless perhaps the two
"shorter ones" can match only proper prefixes of strings matched by "the
real regexp", and also one of the other.

Of course people will enter highly structured data incorrectly, and of
course you want to help them get it straight.  Those are two reasons not to
use regexps at all <0.1 wink>.  0.9 seriously!  I entered:

    3/40 @ 7:30pm, Knitting Factory, New York, NY (with Doctor
    Nerve/Meridian Arts Ensemble

and it got rejected, without a clue as to why.  Error detection and recovery
is a Real Pain even with a Real Parser; a mechanical trick like "report
rightmost progress" isn't going to turn the infinitely feebler regexp
gimmick into a solution.

What you have here is a classical "frame" problem:  a number of information
"slots" that need to be filled in.  You picked a rigid format to ease your
own implementation, perhaps because you've been hoodwinked into believing
that "a regexp" *should* be "the solution".  Well, it isn't.  It may be
*part* of a solution, though:  for each slot that needs to be filled in, use
a tiny regexp to search over the input, removing the data that matches then
moving on to the next tiny regexp.  In this way, e.g., you would first
search for "a date", and in my example above would properly complain back to
me that you couldn't find a properly formed date.  It's a little more code
for you to write, but much better for your users, and much more flexible
over the long term.

> compellingness-is-in-the-eyes-of-the-programmer-ly y'rs,

usability-is-in-the-frustration-of-the-user-ly y'rs  - tim






More information about the Python-list mailing list