Pretty Scheme, ??? Python

Tue Jul 3 09:01:09 EDT 2007

On 2007-07-02, Paul McGuire <ptmcg at austin.rr.com> wrote:
> On Jul 2, 3:56 pm, Neil Cerutti <horp... at yahoo.com> wrote:
> from pyparsing import *

It's always good when your messages start like that. ;)

> """
> Ok, here is the step-by-step, beginning with your posted BNF.
> (Based on your test cases, I think the '{}'s are really
> supposed to be '()'s.)

Yeah, the Scheme version of WAE uses curlies in examples just so
it looks slightly different from Scheme, although since the
Scheme version is built on the Scheme read function, it actually
accepts several different kinds of delimiters.

Python didn't need to do that, I thought.

My first impulse when programming this exercise was to ape the
Scheme strategy, going with a syntax analogous to Python's, using
Python's code or AST modules. But it turns out I'm not a clever
enough language designer.

Moreover, the fun of the book I mentioned is in designing the
semantics of the programs. The book basically punts parsing,
leaving it up to the 'read' function. So basically, Python gets
up to speed (except for the define-type and type-case macros)
simply by implementing a read with enough functionality for each
mini-langauge.

> ; <WAE> ::=
> ;   <num>
> ;   | { + <WAE> <WAE> }
> ;   | { - <WAE> <WAE> }
> ;   | {with {<id> <WAE>} <WAE>}
> ;   | <id>
>
> The most basic building blocks in pyparsing are Literal and
> Word. With these, you compose "compound" elements using And and
> MatchFirst, which are bound to the operators '+' and '|' (on
> occasion, Or is required, bound to operator '^', but not for
> this simple parser). Since you have a recursive grammar, you
> will also need Forward. Whitespace is skipped implicitly.
>
> Only slightly more advanced is the Group class, which will
> impart hierarchy and structure to the results - otherwise,
> everything just comes out as one flat list of tokens.  You may
> be able to remove these in the final parser, depending on your
> results after steps 1 and 2 in the "left for the student" part
> below, but they are here to help show structure of the parsed
> tokens.
>
> As convenience functions go, I think the most common are oneOf
> and delimitedList.  oneOf might be useful here if you want to
> express id as a single-char variable; otherwise, just use
> Word(alphas).
>
> At this point you should be able to write a parser for this WAE
> grammar.  Like the following 9-liner:
> """
>
> LPAR = Literal("(").suppress()
> RPAR = Literal(")").suppress()
>
> wae = Forward()
> num = Word(nums)
> id = oneOf( list(alphas) )

The above shadows 'id'; I suppose 'ident' would be better.

> addwae = Group( LPAR + "+" + wae + wae + RPAR )
> subwae = Group( LPAR + "-" + wae + wae + RPAR )
> withwae = Group( LPAR + "with" + LPAR + id + wae + RPAR + wae + RPAR )
>
> wae << (addwae | subwae | withwae | num | id)
>
> tests = """\
>  3
>  (+ 3 4)
>  (with (x (+ 5 5)) (+ x x))""".splitlines()
>
> for t in tests:
>     print t
>     waeTree = wae.parseString(t)
>     print waeTree.asList()
>     print
>
> """
> If you extract and run this script, here are the results:
>  3
> ['3']
>
>  (+ 3 4)
> [['+', '3', '4']]
>
>  (with (x (+ 5 5)) (+ x x))
> [['with', 'x', ['+', '5', '5'], ['+', 'x', 'x']]]

How can I make it barf for testcases like '(+ 2 3))'? It doesn't
seem to expect an Eof.

> Left as an exercise for the student:
> 1. Define classes NumWAE, IdWAE, AddWAE, SubWAE, and WithWAE whose
> __init__ methods take a ParseResults object named tokens (which you
> can treat as a list of tokens), and each with a calc() method to
> evaluate them accordingly.
> 2. Hook each class to the appropriate WAE class using setParseAction.
> Hint: here is one done for you:  num.setParseAction(NumWAE)
> 3. Modify the test loop to insert an evaluation of the parsed tree.

I use doctest, so it looks quite different. On the other hand, it
actually checks that the results are correct. ;)

> Extra credit: why is id last in the set of alternatives defined
> for the wae expression?

I'm not sure. When I moved it to first all my valid tests still
passed, but one of the deliberately erroneous ones caused
a different exception, e.g., '(+ 2)'. Writing my own parser does
make error handling more predictable, but probably PyParsing can
be configured to do what I want.

My AST's from the first version of the program translated easily
into your program, with almost no muss or fuss. The muss is that,
since all the __init__ functions now expect a token list instead
of named arguments, they are now cryptic, and creating AST's
manually became annoying. The fuss is that I do have to create
one in With's calc function. It should be unnecessary for the AST
objects to be so dependent upon the grammar to work correctly.
I suppose a solution would be to add another layer of abstraction
in between.

Read it and weep. The program hasn't changed much. It's still
uglier than the Scheme. Attached at the end is my original sexp
and WAE parser, for comparison with the PyParsing portion of the
program. This time I've included all the tests, but no other
useful docstrings.

""" A PyParsing solution for WAE, composed by Paul MaGuire, with
    semantics by Neil.

    >>> parse('3')
    <Num 3>

    >>> parse('(+ 3 45)')
    <Add <Num 3> <Num 45>>

    >>> parse('(with (x (+ 5 5)) (+ x x))')
    <With <Id x> <Add <Num 5> <Num 5>> <Add <Id x> <Id x>>>

    >>> parse('3').calc()
    3

    >>> parse('(+ 2)')
    Traceback (most recent call last):
        ...
    ParseException: Expected "(" (at char 4), (line:1, col:5)

    >>> parse('(+ 3 45)').calc()
    48

    >>> parse('(- 3 2)').calc()
    1

    >>> parse('(- (+ 8 4) 4)').calc()
    8

    >>> parse('(with (x (+ 5 5)) (+ x x))').calc()
    20

    >>> parse('(with (x 5) (+ 2 x))').calc()
    7

    >>> parse('(with (x x) x)').calc()
    Traceback (most recent call last):
        ...
    SyntaxError: Free identifier

"""
import operator
from pyparsing import *

class Num(object):
    def __init__(self, tokens):
        self.number = int(tokens[0])
    def __repr__(self):
        return '<Num %d>' % self.number
    def subst(self, sub_id, value):
        return self
    def calc(self):
        return self.number

class Id(object):
    def __init__(self, tokens):
        self.name = tokens[0]
    def __repr__(self):
        return '<Id %s>' % self.name
    def __eq__(self, an_id):
        return an_id.name == self.name
    def subst(self, sub_id, value):
        if sub_id == self:
            return value
        else:
            return self
    def calc(self):
        raise SyntaxError('Free identifier')

class BinOp(object):
    def __init__(self, tokens):
        self.op, self.name = {'+': (operator.add, 'Add'), 
                '-': (operator.sub, 'Sub')}[tokens[0][0]]
        self.lhs = tokens[0][1]
        self.rhs = tokens[0][2]
    def __repr__(self):
        return '<%s %r %r>' % (self.name, self.lhs, self.rhs)
    def subst(self, sub_id, value):
        self.lhs = self.lhs.subst(sub_id, value)
        self.rhs = self.rhs.subst(sub_id, value)
        return self
    def calc(self):
        return self.op(self.lhs.calc(), self.rhs.calc())

class With(object):
    def __init__(self, tokens):
        self.bound_id = tokens[0][1]
        self.named_expr = tokens[0][2]
        self.bound_body = tokens[0][3]
    def __repr__(self):
        return '<With %s %r %r>' % (self.bound_id, 
                self.named_expr, self.bound_body)
    def subst(self, sub_id, value):
        self.named_expr = self.named_expr.subst(sub_id, value)
        if sub_id != self.bound_id:
            self.bound_body = self.bound_body.subst(sub_id, value)
        return self
    def calc(self):
        tokens = [self.named_expr.calc()]
        return self.bound_body.subst(self.bound_id, Num(tokens)).calc()

LPAR = Literal("(").suppress()
RPAR = Literal(")").suppress()

wae = Forward()
num = Word(nums).setParseAction(Num)
ident = Word(alphas).setParseAction(Id)

addwae = Group( LPAR + "+" + wae + wae + RPAR ).setParseAction(BinOp)
subwae = Group( LPAR + "-" + wae + wae + RPAR ).setParseAction(BinOp)
withwae = Group( LPAR + "with" + LPAR + ident + 
        wae + RPAR + wae + RPAR ).setParseAction(With)

wae << (addwae | subwae | withwae | num | ident)

def parse(s):
    return wae.parseString(s).asList()[0]

if __name__ == '__main__':
    import doctest
    doctest.testmod()

Below is my simple sexp parser and wae adapter (most tests and
some docs removed for brevity). The hand-written scanner is a
result of the horrible performance and readability of the
regex-based scanners I experimented with.

""" sexp.py

A compiler of s-expressions into tuples.

Functions provided:

    read(sexp) Convert a Python str that represents one s-expression into a
               tuple.

    For example,

    >>> read('(+ 4 5)')
    ('+', 4, 5)

    >>> read('(+ 4 5 (- 8))')
    ('+', 4, 5, ('-', 8))

    Only s-expressions, symbols, and integers are recognized.

    Only one expression may appear in the string. Trailing
    characters result in an exception.

    >>> read('(first experssion) (second expression)')
    Traceback (most recent call last):
        ...
    ParseError: Expected EOF, got (

    EBNF:
    goal -> [ <expr> ] EOF
    sexp -> '(' { <expr> } ')'
    expr -> <sexp> | INTEGER |SYMBOL

"""

def read(sexp):
    """ Read an s-expression (in the form of a string), and return a list
    representing its parse tree.

    >>> read('')

    >>> read('400')
    400

    This parser readly only one expression at a time, unlike a
    scheme reader.

    >>> read('23 skidoo')
    Traceback (most recent call last):
        ...
    ParseError: Expected EOF, got SYMBOL

    >>> read('()')
    ()

    >>> read('(()')
    Traceback (most recent call last):
        ...
    ParseError: Expected ), got EOF

    >>> read('(+ 4 5)')
    ('+', 4, 5)

    """
    return Parser(scan(sexp)).parse()

class ParseError(SyntaxError):
    pass

class Parser(object):
    """ Parser for simple s-expressions.

    BNF:
    goal           ::= <opt_expr> EOF
    opt_expr       ::= <expr>
    opt_expr       ::= ~
    expr           ::= <sexp>
    expr           ::= INTEGER
    expr           ::= SYMBOL
    sexp           ::= '(' <opt_expr_list> ')'
    opt_expr_list  ::= <expr_list>
    opt_expr_list  ::= ~
    expr_list      ::= <expr> <expr_tail>
    expr_tail      ::= <expr> <expr_tail>
    expr_tail      ::= ~

    """
    def __init__(self, iterable):
        iterator = iter(iterable)
        self.token, self.value = iterator.next()
        self.next = iterator.next

    def match(self, want):
        if want != self.token:
            raise ParseError("Expected %s, got %s"
                    % (want, self.token))
        value = self.value
        if self.token != 'EOF':
            self.token, self.value = self.next()
        return value

    def parse(self):
        return self.goal()

    def goal(self):
        ast = self.opt_expr()
        self.match('EOF')
        return ast

    def opt_expr(self):
        if self.token == 'EOF':
            return None
        else:
            return self.expr()

    def opt_expr_list(self):
        if self.token in ['(', 'INTEGER', 'SYMBOL']:
            return self.expr_list()
        else:
            return ()

    def expr_list(self):
        return (self.expr(),) + self.expr_tail()

    def expr_tail(self):
        if self.token in ['(', 'INTEGER', 'SYMBOL']:
            return (self.expr(),) + self.expr_tail()
        else:
            return ()

    def expr(self):
        if self.token == '(':
            return self.sexp()
        elif self.token == 'INTEGER':
            return self.match('INTEGER')
        elif self.token == 'SYMBOL':
            return self.match('SYMBOL')

    def sexp(self):
        self.match('(')
        ast = self.opt_expr_list()
        self.match(')')
        return ast

# EOF is the end of the input stream.
# INTEGER is '[0-9]+'
# SYMBOL is '[^0-9][^ ]*'

def scan(sexp):
    """ Yield the tokens in sexp, in the order they are encountered.

    >>> for token in scan('()'):
    ...   print token
    ('(', '(')
    (')', ')')
    ('EOF', 'EOF')

    >>> for token in scan('(+ 5 3)'):
    ...   print token
    ('(', '(')
    ('SYMBOL', '+')
    ('INTEGER', 5)
    ('INTEGER', 3)
    (')', ')')
    ('EOF', 'EOF')

    >>> for token in scan('12a'):
    ...   print token
    Traceback (most recent call last):
        ...
    ParseError: Ill-formed INTEGER

    """
    ix = 0
    length = len(sexp)
    while ix < length:
        if sexp[ix] in '()':
            yield (sexp[ix], sexp[ix])
            ix += 1
        elif sexp[ix] in '0123456789':
            jx = ix+1
            while jx < length and sexp[jx] in '0123456789':
                jx += 1
            if jx < length and sexp[jx] not in ' \t\n()':
                raise ParseError("Ill-formed INTEGER")
            yield ('INTEGER', int(sexp[ix:jx]))
            ix = jx
        elif sexp[ix] not in ' \t\n':
            jx = ix+1
            while jx < length and sexp[jx] not in ' \t\n)(':
                jx += 1
            yield ('SYMBOL', sexp[ix:jx])
            ix = jx
        else:
            ix += 1
    yield ('EOF', 'EOF')

if __name__ == '__main__':
    import doctest
    doctest.testmod()

And finally, the WAE adapter, which is not necessarily required
by the PyParsing solution.

def parse(expr):
    """ Compile a string representing an WAE expression into an object that
    represents the abstract syntax tree of that expression. The AST provides a
    calc method that returns the result of evaluating the expression.

    >>> parse('(+ 4 5)')
    <Add <Num 4> <Num 5>>

    """
    def parse_ast(ast):
        if isinstance(ast, int):
            return Num(ast)
        elif isinstance(ast, str):
            return Id(ast)
        elif isinstance(ast, tuple):
            op = ast[0]
            if op in ['+', '-'] and len(ast) == 3:
                if op == '+':
                    return Add(parse_ast(ast[1]), parse_ast(ast[2]))
                elif op == '-':
                    return Sub(parse_ast(ast[1]), parse_ast(ast[2]))
            elif op == 'with' and len(ast) == 3 and len(ast[1]) == 2:
                return With(ast[1][0], parse_ast(ast[1][1]), parse_ast(ast[2]))
        raise SyntaxError("Ill-formed expression")
    return parse_ast(sexp.read(expr))

The isinstance calls are quite unfortunate compared to type-case.
For example, type-case will signal an error if I haven't handled
all possible cases.

-- 
Neil Cerutti