regular expressions ... slow

Tue Nov 18 02:54:46 EST 2008

On 17 Nov., 22:37, Uwe Schmitt <rocksportroc... at googlemail.com> wrote:
> Hi,
>
> Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1.html
> ?
>
> Are there any plans  to speed up Pythons regular expression module ?
> Or
> is the example in this artricle too far from reality ???
>
> Greetings, Uwe

Some remarks.

The basic idea outlined in the article is something that is re-
discovered from time to time. Last year I started working on a parse
tree validator which is a tool that checks whether a parse tree P on
which transformative actions were performed belongs to a language
described by a grammar G. To check this G was initially transformed
into a set of NFAs representing grammars rules. The checker is a
tracing tool that selects nodes in a parse tree in order to step
through the transitions described in the NFA. If it can't select a
state the validation fails.

Generally there are rules which are non left factored well just like

R: a b c | a b d

The NFA transition table accordingly looks roughly like

<R>:
  (R,0) -> (a,1), (a,2)
  (a,1) -> (b,1)
  (b,1) -> (c,1)
  (c,1) -> (None,'-')
  (a,2) -> (b,2)
  (b,2) -> (d,2)
  (d,1) -> (None,'-')

A valid parse tree P which corresponds to rule <R> has one of two
forms:

P1 = [R, a, b, c] or P2 = [R, a, b, d]

Suppose we want to validate P1 by our NFA. First we select the NFA <R>
using root node-id R. R corresponds to the start state (R,0). Next we
select the follow states which are [(a,1), (a,2)].  The projection on
the first argument yields a.

Obviously we can shift to a in P1 and match the follow states (a,1)
and (a,2) in <R>:

P1 = [R, a, b, c]
         ^

Next we shift to b

P1 = [R, a, b, c]
            ^

Now we have two traces in <R> we can follow. The one that is assigned
by (a,1) and the other one by (a,2). Since we can't decide which one
we *both at the same time*. This yields follow states (b,1) and (b,2)
which can be projected on b.

Next we shift to c

P1 = [R, a, b, c]
               ^
Again we follow both traces and get (c,1) and (d,1). (c,1) fits well.
Now we have to check that termination of P in this state is at least
an option. The follow state of (c,1) in <R> is (None,'-') which is the
EXIT symbol of the NFA. The validation was successful.

The same NFAs used to validate parse trees can be used within a top-
down parser generator as "input tables". Such a parser generator
automatically handles left-factoring right and is LL(*). It is also O
(n) where n is the length of the input token stream. Unlike parser
generators such as ANTLR it achieves LL(*) without any advanced
lookahead scheme. Instead it produces traces of NFA states and cancels
unused traces ( e.g. there are two initial traces in our example ((R,0)
(a,1)(b,1)) and ((R,0)(a,2)(b,2)). The first one is canceled once d is
selected. Otherwise the second one gets canceled when c gets
selected ).

A full parser generator for CFGs is more complex in many respects than
a regular-expression matcher and AFAIK the one I've built over the
last year is unique in its category. It is also simpler in a few
aspects because full regexp matchers are also deal with lots of
context sensitive issues not being expressed in a CFG.

I used the same parsing scheme to produce a lexer. There is
considerable overhead though because it produces one parse-tree per
character and I intend to avoid this by using a hybrid lexer in the
future that contains some very fast and some slower components. The
reason why I used the parser generator in the lexer is that it has one
signficant advantage over classic regexp parsing schemes: it avoids
dependence on order. Whether you write R = ab|abc or R=abc|ab is
insignificant. Longest match is always preferred unless otherwise
stated. This makes extending the grammar of the lexer very simple.

All of this is prototyped in Python and it is still work in progress.
As long as development has not reached a stable state I refuse to
rebuild the system in an optimized C version. Otherwise if someone
e.g. a student intends to write his master thesis about a next
generation regexp engine built on that stuff or even intends to
contribute to the mentioned lexer I would of course cooperate.

Kay

www.fiber-space.de
kay at fiber-space.de