Building parser modules with flex and bison

Tommy Marcus McGuire mcguire at cs.utexas.edu
Fri Aug 18 19:36:41 EDT 2000


If you're interested, and I kind of pity you if you are, I've got some
.h files that, when included in flex or bison inputs with some suitable
extra verbiage, create Python modules for scanning and parsing.

The source and an example is at:
 http://www.cs.utexas.edu/users/mcguire/software/fbmodule/

I'll tack the README on the end of this message, in case anyone wants
to know more.

Tommy McGuire
mcguire at cs.utexas.edu

------

FlexBisonModule-1.0

Use Flex and Bison to create Python modules for parsing strings.

http://www.cs.utexas.edu/users/mcguire/software/fbmodule/


RATIONALE:

 - Some people _like_ using flex and bison.  (Yes, I pity them, too.
(Actually, fooling with flex is kind of fun, though.))

 - flex and bison are both reasonably standard, reasonably popular, and
pretty well documented.  (See O'reilly's _lex & yacc_ by John R. Levine,
Tony Mason and Doug Brown (preferrably the second edition, which is much
improved over the first) as well as the flex and bison docs.)

 - flex and bison are pretty fast (in their own way, of course).  If
more speed is needed when using both, it should be possible to merge the
two modules into a single C module without changing any of the
scanning or grammar rules.

 - flex and bison are flexible.  If the speed of the Python program is
not sufficient, you'll have debugged grammar rules which can be used
to rewrite the program entirely in C.


REQUIREMENTS:

This sofware has been tested with:

 - flex 2.5.4
 - bison 1.28
 - gcc 2.95.2
 - Python 1.5.2

under Linux.  FlexModule uses flex's C++ option to create a Python
type.  BisonModule uses GCC's variable argument macros, if nothing
else, and therefore probably won't work with another C compiler.


FILES:

BisonModule.h - C header file which should be included by a Bison
grammar specification.

FlexModule.h - Similarly, a C++ header file used by a Flex scanner
specification.

Makefile.pre.in - Modified version of Makefile.pre.in from Python
1.5.1; changed to get the C++ Flex output to compile and with rules to
run flex and bison to produce .cc and .c files.

Setup.in - Template for a Setup.in file.

Symbols.py - Sample Python code for Symbol (as in a non-terminal Bison
grammar symbol) and Token classes (a subclass of Symbol, for terminal
Flex symbols).


INSTALLATION:

Copy BisonModule.h, FlexModule.h, Symbols.py, and Makefile.pre.in to
the directory where you will build the modules.  Create a Setup.in and
type "make -f Makefile.pre.in boot".


USE:

The two .h files are used similarly, by including them into the bison
or flex input and adding some declarations and a macro call at the
end.


Compiling with FlexModule:

Use "%option c++" to get flex to generate a C++ scanner.  The actual
flex rules for tokens are pretty much as normal for flex; just return a
unique token type integer.  At the end of the flex input file, add two
things: An array of TokenValues, which provide a map between strings and
the numerical token types, and a call to the FLEXMODULEINIT macro.  For
example:

...
%%

TokenValues module_tokens[] = {
        {"NUMBER", NUMBER},
        {"VAR", VAR},
        {0,0}
};

FLEXMODULEINIT(hoclexer, module_tokens);

where NUMBER and VAR are constants and "hoclexer" is the name of the
module.


Using FlexModule in Python:

After importing the module, it gives access to:

 - module.Scanner, a function which returns a scanner object from a 
a function to create tokens and a string.  The function should have
the same arguments as the constructor of Symbols.Token: a numeric
type, the string of the token, and a line number (taken from the flex
scanner, if "%option yylineno" was specified).  The token generated
should have a "type" attibute, which will be used by BisonModule to
get the numeric type of the token.
 - names, a map between numeric types and the string names of the
tokens.  This is created from the TokenValues array.
 - types, a map between string names and numeric types.

Each scanner object retuned by Scanner has three methods:

 - getlineno, which returns the current line number, according to the
scanner.
 - readtoken, which returns the next token from the string, created by
the function passed to the scanner constructor.
 - setstring, which resets the string to be scanned.


Compiling with BisonModule:

The values of the symbols (non-terminal and terminal) are Python
objects; the terminal symbols should have a "type" attribute with
the numerical type of the token, as described above.  These objects
are passed around by the rules to build a parse tree, where each
non-terminal symbol has a list of child symbols.

The grammar file read by bison is fairly normal, but the code associated
with the rules should be fairly limited:

 - The start rule of the grammar should, when reduced, call the
RETURNTREE macro with the value of the top of the tree.  The argument
of RETURNTREE is the symbol object that represents the top of the parse
tree.  For example:

start:          list                    { RETURNTREE($1); }

where start is the initial symbol, sets the parse tree to the list
symbol.

 - A rule can call the REDUCE macro to create a new symbol object.  The
first argument to REDUCE should be a numerical type for the symbol; the
remaining arguments will be added to the new symbol object's children
list.  For example:

list:           /* nothing */           { $$ = REDUCE(LIST); }

(where LIST is an integer constant) creates a LIST symbol with no
children and passes it up the tree.

 - A rule can call the REDUCELEFT macro.  This macro does not create a
new symbol, but adds the second and subsequent arguments to its first
argument's children list.  This macro is useful for left-recursive rules
and for reusing lower level symbols.  For example,

                | list expr '\n'        { REDUCELEFT($1, $2); }

adds an expression to an existing list of expressions.  Also,

expr:           expr '+' expr         { $$ = REDUCELEFT($2, $1, $3); }

reuses the '+' token returned by the scanner, adding the two
subexpressions as its children.

 - There is also a REDUCERIGHT macro, which prepends its arguments,
but it hasn't been tested since I have not used right-recursive rules.

 - A REDUCEERROR macro is available to partially handle syntax errors.
It creates a SYNTAXERROR symbol (whose numerical type is -1), which
can be incorporated into the parse tree like any other symbol.  Python
code can subsequently walk the tree to report syntax errors.  For
example,

                | list error '\n'       { REDUCELEFT($1, REDUCEERROR); }

creates a syntax error symbol and adds it to a list of expressions.

Like FlexModule, a BisonModule needs an array associating numeric
types and strings and a final macro call to set everything up:

static SymbolValues module_symbols[] = {
        {"LIST", LIST},
        {0,0}
};

BISONMODULEINIT(hocgrammar, module_symbols);


Using BisonModule in Python:

Each bison module exports into Python:

 - A ParserError exception object, used when the parser cannot handle
a syntax error in the input.  (In general, for good error handling, I
am given to understand that this should not occur and thus this
exception should not be thrown.)

 - a parse function, which takes a function to create symbols similar
to the Scanner function above and a scanner object.  The function
should match the Symbols.Symbol constructor in takeing a type, a list
of children, and a line number.  The scanner object should have
readtoken and getlineno methods.

 - names and types dictionarys, like FlexModule above.

 - a debug function, which toggles the bison parser's debug flag.


BUGS:

Ok, so I don't really understand Bison error handling.  If someone
could explain it to me (use small words, I'm not too bright) in such a
way as to improve the error handling of BisonModule, I'd appreciate
it.

Neither FlexModule nor BisonModule appears to leak memory when I've
tested it, but the behavior of BisonModule when it throws a
ParserError is not entirely well tested.

FlexModule currently only scans a string passed in to it.  It
should not be difficult to back it out to look at a file, if need be.

FlexModule uses flex's C++ support to create scanner objects.  If only
one scanner is needed, it might be possible to back that out and thus
generate scanners in C.


FURTHER INFORMATION:

For more information, the code, and an example or two, see

http://www.cs.utexas.edu/users/mcguire/software/fbmodule


Tommy M. McGuire
mcguire at cs.utexas.edu



More information about the Python-list mailing list