lexing nested parenthesis

Mon Jul 29 22:45:45 EDT 2002

Dave Cinege <dcinege at psychosis.com> wrote in message news:<mailman.1027915180.16513.python-list at python.org>...
> It's nice when you open up your reference book, and immediately
> find a section on what you want to do. 
> 
> It sucks large donkey rocks when that section is titled:
> "Difficulties and Impossiblies" (Mastering Regular Expressions)
> 
> I've been playing with shlex, and think in my application I'm
> better off minimizing or avoiding it. 
> 
> Primarily I want to do search, evaluate, modify, and replace
> on a line by line basis. The diffculty comes with matching
> arbitrary nested constructs, IE nested parenthesis.
> 
> IE
> 	if 1 and (var1 or ?(-d /etc/)):

Congratulations, you've just discovered the classic example of a
pattern
that regexes can't match no matter how hard you try[0] :).

Regular Expressions are traditionally compiled into either a DFA or an
NFA[1]hich are just two differnt types of state machines.  The input
then
gets fed into the state machine, and you either reach an "accept
state"
(ie. the regex matched), or you reach an "error state"/run out of
input
(ie. it didn't ;).  

Looking closer at a DFA.  As it receives input, it just uses the input
symbol and its current state to lookup in a table the next state and 
repeats.  In pseudo-code:

def executeDFA(dfaTable, input):
    state = dfaTable[(None, None)]  # Start state
    for c in input:
        try:
            state = dfaTable[(state, c)]
        catch KeyException:
            return "No Match"  # If state not in table, state is
"error state"
        if state.accept:
            return "Match"     # An accept state signals the pattern
matched

    return "No Match"          # If we run out of input, no pattern
matched

(ok, so it's 'executable pesudo-code' ;)

The key issue that concerns you is that there is only so much
information a
single state and one character of input can provide.  An arbitrarily
nested
structure (such as your nested parentheses) requires an equivelent
amount
(ie. arbitary) of information so it can track the current nesting
level.
The fixed information capacity in a DFA/NFA simply isn't enough :(.

So when our fixed data-structure runs out of space, it's time to move
to a
dynamic data-structure.  In this case we go looking for a stack, and
we
find two readily available options.

We can use the call stack.  Each time we find something 'interesting'
(in
this case an opening parenthesis), we call a method that will continue
running the state-machine until it finds a matching 'interesting'
feature
(ie. a closing parentesis) and then return.  So the call stack will
track
the number of ('s we have seen, and as we find )'s we'll unwind the
stack,
and so we end up with enough information to track the nesting level...
and
incidentally have written a recursive descent parser.

Or we can use an external stack, and as we find interesting features
we
will push them onto the stack, and when we see that the top of the
stack
looks like a complete interesting feature, we will pop it off (and
possibly
push onto the stack the fact that we've seen a complete interesting
feature
here).  Actually recognising the 'complete interesting feature' on the
top
of the stack is a little involved, but effectively that's what LR
parsers
do (ie. yacc and friends).

> I want to find ?(.*), but not runneth under or over.
> Evaluating a token at a time will become disgusting. It also
> seems redundant as you can see it's the same parsing rules as
> Python has. (And in fact that is what the line will be
> converted to.)

...and you are right.  Surprise surprise!  You need a parser to parse
parsing rules ;).  What you have at the moment (regexes and shlex) is
a
lexer.  Lexers just aren't smart enough to track the context as you
require[2].

> Can anyone recommend a 'Better Way'? I'm thinking maybe subclassing
> generate_tokens.

Write a simple recursive descent parser that sits on top of shlex and
handles the parentheses.

Andrae Muys

[0] My understanding of Regular Expressions is restricted to
undergraduate compilers augmented with a little of my own reading.  I
recognise that modern regex engines most likely go beyond the basic
DFA/NFA approach I'm personally aware of.  I'd be interested in
learning exactly how, but time of course is always an issue :).

[1] Deterministic Finite Automata or Non-deterministic Finite
Automata.  ie. a state machine that always knows where to go next, vs.
one that dosn't.  Still they're both still just finite state machines.

[2] "Yep that's handwriting.  Recognise it anywhere.  All those
squiggles
and dots and stuff".