[Python-ideas] hybrid regex engine: backtracking + Thompson NFA

Sun Nov 25 11:52:37 EST 2018

Hello,

I found this topic discussed on the python-dev ML back in 2010 [1].
I'm bringing it up 8 years later with a variation.

In short: The article [2] highlight that backtracking-based regex
engine (like SRE in python) have pathological cases that run in an
exponential time of the input, while they could run in a linear time.
Not mentioned by the article is that even quadratic time can be a
problem with large inputs. Which happen pretty often when you're
looking for delimited stuff like this :
re.match(r'.*?,(.*),,', "a,"*10000)

Of course, there's a catch. Backreferences most-likely cannot be
implemented in a guaranteed linear time, and some cases of look-behind
assertions might prove difficult as well. But other features like
alternatives (foo|bar) and repetitions (.*) are no problem.

The general idea of the Thompson's algorithm is that the simulation of
the NFA is basically in all the reachable states at the same time
while parsing the input string character by character. Of course, for
regex engines that use a VM (like Python's SRE) you can also make the
execution of the equivalent byte-code be in several states at once.

The 2010 discussion seems to be about having two separate engines and
selecting the best one for a given regex. What I'm proposing here is
to resort to backtracking only for the regex features that need it.
Meaning that within a regex like r'(.*),.* .*,\1' the evaluation of
the middle part would use the Thompson's algorithm, while the \1 could
trigger the backtracking mechanism if the string doesn't match.

What do you think about it?
Has this already been discussed and rejected?
Is it just a matter of showing the code? (_sre.c seems... non-trivial)
Has this already been juged not worth the maintainance effort?

AFAIK, there's not many hybrid regex engines out there. But one
notable implementation is the third version of the Henry Spencer's
regex engine [3]. Which he doesn't seem to have documented publicly,
but postgres has done a pretty good job at reverse-engineering a high
level view of it [4]. I'm still unsure how the backtracking of some
parts of the regex interact with the multi-state evaluation of the
other parts. But at least it exists and works, so it is feasible.

Best regards,
Celelibi

[1] https://mail.python.org/pipermail/python-dev/2010-March/098354.html
[2] https://swtch.com/~rsc/regexp/regexp1.html
[3] https://github.com/garyhouston/hsrex
[4] https://github.com/postgres/postgres/tree/master/src/backend/regex