When Good Regular Expressions Go Bad

Sun Oct 3 03:06:11 EDT 1999

[Douglas Alan]
> ...
> It seems to me that even when a regular expression fails to match a
> string, you might want to know just how far it was able to get before
> getting stuck.
> ...

[Malcolm Tredinnick]
> Isn't there a problem here with which "failing match" you are going
> to report on?

Likely no more or less than there's a problem with which "succeeding match"
to pick given an ambiguous match.

> Do you take the longest failing match

I would.

> (which will probably be horrendously expensive in a complexity
> sense, since you will have to scan the entire candidate string),

Whether the engine is a DFA or NFA, it marches over the string one character
at a time, until the overall match succeeds or fails.  So long as there's a
chance of succeeding, it "moves right a character".  Whenever it moves
right, it leaves behind a partial match.  Keep track of the longest of
those.

> ...
> While I agree it's sort of a useful thing to have sometimes (error
> recovery during parsing - something like detecting possible spelling
> errors),

I don't know that it's useful.  Parsing usually uses regexps to tokenize,
and another scheme entirely to infer higher-level structure.  The latter
scheme points out the offending token if the higher-level structure isn't
matched, and that's clue enough.  If the regexps failed to match a token, I
don't expect it would do any good to say "well, the first M characters of
this stretch of input looked like a number, while the first N looked like an
identifier, and the first O like a string, and the first P like a keyword,
and ..." -- even less good to say "the first N characters looked like one of
a number, identifier, string, keyword, ... but I don't know which".

> it needs to be defined a little better before being implemented.

Ha -- these are regexps, so this is backwards:  the implementation defines
the semantics <0.5 wink>.  Let's look at .match():  A regexp R defines a
regular language L.  Given a regular language L and an integer N >= 0, the
set of N-character prefixes of the strings in L is also a regular language.
Denote that by L(N).  Then if R.match(S) fails, it can return the largest
integer i such that S[:i] is in L(i).

Cute one:  Given regexp

   .*x.*y.*z$

and string

   abcdefg

it follows that the entire string is the "longest failing match", since e.g.
abcdefgxyz would have matched and abcdefg is a prefix of that.

looking-more-useful-by-the-minute<wink>-ly y'rs  - tim