When Good Regular Expressions Go Bad
Skip Montanaro
skip at mojam.com
Fri Oct 1 11:59:07 EDT 1999
>>>>> "Tim" == Tim Peters <tim_one at email.msn.com> writes:
Tim> I don't see how this relates to Douglas's idea, unless perhaps the
Tim> two "shorter ones" can match only proper prefixes of strings
Tim> matched by "the real regexp", and also one of the other.
Yes, I should have given example re's (parental guidance suggested):
First, the real thing:
r"(?P<m>\s*(?P<date>\d+/\d[-,\d]*/\d+),?"
"(?P<venue>[^,]+),"
"(?P<city>[^,]+),"
"(?P<state>[A-Za-z\s]+),"
"(?P<time>[^,]+),"
"(?P<info>.*))(?P<u>.*)"
Then, prefix 1:
r"(?P<m>\s*(?P<date>\d+/\d[-,\d]*/\d+),?"
"(?P<venue>[^,]+),"
"(?P<city>[^,]+),"
"(?P<state>[A-Za-z\s]+))"
"(?P<u>.*)"
Then, prefix 2:
r'(?P<m>\s*\d+/\d[-,\d]*/\d+,\s+[^,]+)(?P<u>.*)'
Tim> Of course people will enter highly structured data incorrectly, and
Tim> of course you want to help them get it straight. Those are two
Tim> reasons not to use regexps at all <0.1 wink>. 0.9 seriously! I
Tim> entered:
Tim> 3/40 @ 7:30pm, Knitting Factory, New York, NY (with Doctor
Tim> Nerve/Meridian Arts Ensemble
Known problem. I'd like to get rid of regular expressions altogether, but I
don't have a better general pattern matcher at present and fewer resources
to develop anything. Consequently, I adapt what I have at hand.
Tim> and it got rejected, without a clue as to why. Error detection and
Tim> recovery is a Real Pain even with a Real Parser; a mechanical trick
Tim> like "report rightmost progress" isn't going to turn the infinitely
Tim> feebler regexp gimmick into a solution.
That's because your pattern didn't match even the shortest pattern. I had
to stop somewhere. I suppose I could have added a few more re's. With
Douglas's "return the matching prefix" suggestion I could have done it all
with one regular expression.
Tim> What you have here is a classical "frame" problem: a number of
Tim> information "slots" that need to be filled in. You picked a rigid
Tim> format to ease your own implementation, perhaps because you've been
Tim> hoodwinked into believing that "a regexp" *should* be "the
Tim> solution". Well, it isn't.
Perhaps I asked for this by not presenting more background in my previous
message. I adapted an existing solution that currently handles schedules
from a couple thousand different web pages and email submissions in a very
wide range of formats (I adapt to the format I find instead of asking them
to adapt to me). In almost all cases, I am able to parse the input with a
slightly higher level notation:
%{smonth}/%{days}/%{syear}\* %{st} %{venue}/%{city}%{?\(w/%{performer+}\)}%{?=%{time}=}%{?/%{info}}
The above "compiles" into a very messy regular expression that begins
[0-9]+/[0-9]+([-,][0-9]+)*/[0-9]+\*\s+[A-Za-z.]+\s+[^/]+/[-'A-Za-z \t.]+...
Are regular expressions the best solution in the restricted environment I
first described? Not likely. In the broader environment I didn't
originally describe they work fairly well (and I'm the only one who has to
generate them).
Skip
More information about the Python-list
mailing list