split hangs

Mon Jan 1 18:42:27 EST 2001

[davide at mercurio.localdomain]
> I tried the equivalent regexp in emacs
>   <\([^<>"]+\("[^"]*"\)?\)+>
> with bigger and bigger files and it does not seem to give an
> exponential behaviour.

Haven't tried it myself, but I can believe it.

> In everything I read regexp matching is given as a linear time
> problem in n + m (where n is the size of the testing string
> and m the size of the regular expression).

Then you had better start reading better stuff <wink>.  I'll repeat the
original recommendation:  Jeffrey [note: I erroneously said Jonathan before]
Friedl's excellent book "Mastering Regular Expressions" (O'Reilly).

There are two "obvious" ways to implement a regexp engine.  One extreme
Friedl calls "the NFA" way (although he's using "NFA" in a sense that
differs from its technical meaning in the literature).  It compiles the
regexp into a data structure in O(m) time, then uses a backtracking search
engine to do the actual match.  As Friedl explains in excruciating detail,
the matching process *can* take time exponential in n, depending on exactly
what's in the regexp, what's in the string, and how the backtracking engine
is written.  Different implementations use different tricks that save them
from exponential behavior in different cases.  Here's a simple Python
example:

    find = re.compile("(x+)+y$").match

That takes time exponential in n when applied to strings of the form

    "x" * n

I can't make time for it now, but perhaps someone else would like to explain
how that happens, step by step.

The other extreme Friedl calls "the DFA" way (and here his meaning does
match the technical literature).  The matching process there takes O(n) time
worst case.  The rubs are that the initial processing of the regexp may take
time exponential in m (converting an NFA to a DFA can cause exponential
explosion in the number of states), and that pure DFAs can't handle
backreferences.

No free lunch.  More sophisticated schemes try to get the best of both
worlds via a hybrid.  They can get very complex.  For example, study the
source for Henry Spencer's hybrid in recent Tcl:  if you can make sense out
of two lines of it, you're way ahead of the game <wink>.

Note that the literature talks about "real" regular expressions.
Backreferences change the game profoundly.  This page links to a proof that
the matching problem for Perl regexps (and so also Python's) is NP-complete:

    http://perl.plover.com/NPC/

That means it's almost certainly not possible to do better than exponential
time in all cases (in the presence of backreferences), no matter what
implementation is used.

Most languages implement NFA-flavor engines because backreferences are hard
to handle at all in a DFA framework; "compiling" to an NFA goes very fast;
NFA engines are generally pretty easy to extend and to optimize for common
special cases; for whatever reasons, in my experience most programmers find
NFA semantics (in Friedl's sense, not the technical one) easier to
understand; and there are only so many tens of thousands of lines of code
impelementers want to devote to stinking regexps <0.9 wink -- but Python
already has more code in its distribution devoted to regexps than to any
other identifiable subsystem>.

stick-to-simple-regexps-for-simple-purposes-or-live-with-the-
    consequences-ly y'rs  - tim