Audience
The Basics
This section refers to writing lexers, Grammars, and then producing a parser with
these parts. In PyLR, a lexer is part of a parser. This simplifies the interface to
actually doing the parsing. There is an 'engine' which takes the output of the lexer and triggers
the back end of parsing. So we'll start with writing a lexer.
Frequently, lexers will return the matched text as the value in the (token, value) pair. This is the default when you subclass the provided Lexer class. However, there are a lot of different things you may want to happen upon finding a match. For example, sometimes you will want to match something but not use the match or pass it on to the parser.
There is a function in the base class that
provides for all these and more options. It is the
.addmatch(compiled_regex, tokenname="", function=None,
flags=MAPTOK|EXECFN)
method. This method requires only a regular
expression as its argument, but in practice, token names should be passed along with
the re. This practice will make your grammar more readable and easier
to write later.
The function argument, if specified, will make the
lexer execute that function with the resulting match object as it's
one and only argument. The lexer will then return the return value of
the function as the value in the (token, value) pair
it returns. By default, the lexer will just return the token and the associated
matched text.
The flags argument not only defaults to the reasonable MAPTOK|EXECFN, but also adopts to
the values of the other arguments you pass. This way, you dont' have to bother with them much. The one
time it's common to use the flags is when you want the lexer to match something but not return anything until
the next match. It is common to have whitespace treated in this fashion. For this option, you use
.addmatch(re.compile(r"\s+"), "", None, Lexer.SKIPTOK). The example below utilizes all these
options.
Finally, please note the call of the .seteof() function at the end of the __init__ method. This is necessary for all subclassed lexers. The reason it is there is that the token value of EOF is expected to be one greater than any other token value by the parser. Your lexer will not work with the parser api without this call.
Example
from PyLR import Lexer import re, string # # this function will handle matches to an integer. It passes the # integer value to the parser and does the conversion here. # def intfunc(m): return string.atoi(m.group(0)) class mathlex(Lexer.Lexer): # # define the atomic parts with regular expressions # INT = re.compile(r"([1-9]([0-9]+)?)|0") # matches an integer LPAREN = re.compile(r"\(") # matches '(' RPAREN = re.compile(r"\)") # matches ')' TIMES = re.compile(r"\*") # matches '*' PLUS = re.compile(r"\+") # matches '+' WS = re.compile(r"\s+") # matches whitespace def __init__(self): # # initialize with the base class # Lexer.Lexer.__init__(self) # # addmatch examples # self.addmatch(self.INT, idfunc, "INT") for p,t in ( (self.PLUS, "PLUS"), (self.TIMES,"TIMES"), (self.LPAREN, "LPAREN"), (self.RPAREN, "RPAREN"),): self.addmatch(p, None, t) self.addmatch(self.ws, None, "", Lexer.SKIPTOK) self.seteof() # create the lexer lexer = mathlex() # test it with the interactivetest method lexer.interactivetest()
When you write a grammar, you are specifying a context free grammar in normal form, with a few addons to help generate the parser in Python. In other words, you specify a series of productions. For example, to specify a very simple math grammar that will work with the above lexer, you may state something like this:
expression: expression PLUS term | term; term: term TIMES factor | factor; factor: LPAREN expression RPAREN | INT;The identifiers in all uppercase are conventionally terminal symbols. These will be identified by the lexer and returned to the parser. The identifiers in all lowercase are the nonterminal symbols. Each nonterminal must appear on the left somewhere. The corresponding right side may have terminals or non terminals. You may not have empty (epsilon) right hand sides (yet).
Whenever the parser recognizes a production, it will call a function. You may specify the name of the method of the parser class to be invoked for a production by adding a parenthesized name to the right of the production. The above grammar rewritten with method name specifications looks like this (This part will become more clear after the next step, stay with it!).
expression: expression PLUS term (addfunc) | term; term: term TIMES factor (timesfunc) | factor; factor: LPAREN expression RPAREN (parenfunc) | INT;
Those methods must have the name specified in the grammar you wrote. For example, if you built a parser for the above grammar, in order for it to actually add things together, you would have to subclass the class that was produced and then define the methods addfunc, timesfunc, and parenfunc. When each of these methods is called it will be passed the values on the right hand side of the corresponding production as arguments. Those values are either the value returned by the lexer, if the symbol is terminal, or a value returned by one of these special methods, if the symbol is a nonterminal.
In the above example, since the rest of the productions only have one item, it doesn't really matter whether or not they have methods, the parser just calls a reasonable default.
As you can see, we've defined most of what is necessary for building a parser. But the above should tell you that there are a few other things that you may want to define, like the name of the class that is produced, or what lexer is used with the parser. Describing these things along with a grammar like the example above is writing a parser specification for PyLR. A reasonable parser specification for the example we've been following:
_class SimpleMathParser _lex mathlex.mathlex() _code from PyLR.Lexers import mathlex """ expression: expression PLUS term (addfunc) | term; term: term TIMES factor (timesfunc) | factor; factor: LPAREN expression RPAREN (parenfunc) | INT; """the _class keyword defines the name of the class that the parser will take the _lex keyword defines the code used to intialize that parser's lexer the _code keyword defines extra code at the top of the output file. Multiple instances of this keyword will cause the extra source code (in python) to be accumulated. the triple quotes delimit the grammar section.
Please note, the above syntax is subject to change as this is an alpha release and I feel that it can be improved upon.
now you can create a parser. Just use the pgen.py script and it will output your source code:
pgen.py mathparserspec tst.py chronis 3:34am $ python Python 1.5b1 (#1, Nov 27 1997, 19:51:47) [GCC 2.7.2] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> import tst >>> dir(tst) ['PyLR', 'SimpleMathParser', '__builtins__', '__doc__', '__file__', '__name__', '_actiontable', '_gototable', '_prodinfo', 'mathlex'] >>> print tst.SimpleMathParser.__doc__ this class was produced automatically by the PyLR parser generator. It is meant to be subclassed to produce a parser for the grammar expression -> expression PLUS term (addfunc) | term; (unspecified) term -> term TIMES factor (timesfunc) | factor; (unspecified) factor -> LPAREN expression RPAREN (parenfunc) | INT; (unspecified) While parsing input, if one of the above productions is recognized, a method of your sub-class (whose name is indicated in parens to the right) will be invoked. Names marked 'unspecified' will not me invoked. usage: class MySimpleMathParser(SimpleMathParser): # ...define the methods for the productions... p = MySimpleMathParser(); p.parse(text) >>> class MP(tst.SimpleMathParser): ... def __init__(self): ... tst.SimpleMathParser.__init__(self) ... def addfunc(self, left, plus, right): ... print "%d + %d" % (left, right) ... return left + right ... def parenfunc(self, lp, expr, rp): ... print "handling parens" ... return expr ... def timesfunc(self, left, times, right): ... print "%d * %d" % (left, right) ... return left * right ... >>> mp = mathparser() >>> mp.parse("4 * (3 + 2 * 5)") 2 * 5 3 + 10 handling parens 4 * 13