a little parsing challenge ☺
Terry Reedy
tjreedy at udel.edu
Thu Jul 21 18:37:15 EDT 2011
On 7/21/2011 2:53 PM, Xah Lee wrote:
> had hopes that parser expert would show some proper parser solutions…
> in particular i think such can be expressed in Parsing Expression
> Grammar in just a few lines… but so far no deity came forward to show
> the light. lol
I am not a parser expert but 20 years ago, I wrote a program in C to
analyze C programs for proper fence matching. My motivation was the
often obsurity of parser error messages derived from mis-matched fences.
I just found the printed copy and an article I wrote but did not get
published.
Balance.c matches tokens, not characters (and hence can deal with /* and
*/). It properly takes into account allowed nestings. For C, {[]} is
legal, [{}] is not. Ditto for () instead of []. Nothing nests within '',
"", and /* */. (I know some C compilers do nest /* */, but not the ones
I used).
I initially started with a recursive descent parser but 1) this
hard-coded the rules for one language and make changes difficult and 2)
made the low-level parsing difficult. So I switched to a table-driven
recursive state/action machine. The tables for your challenge would be
much simpler as you did not specify any nesting rules, although they
would be needed for html checking.
A key point that simplifies things a bit is that every file is
surrounded by an unwritten BOF-EOF pair. So the machine starts with
having 'seen' BOF and is 'looking' for EOF. So it is always looking to
match *something*.
The total program is nearly four pages, but one page is mostly
declarations and command-line processing, another two pages have
typedefs, #DEFINEs, and tables. The actual main loop is about 25 lines,
and 10 lines of that is error reporting. The output is lines with file
name, row and columns of the two tokens matched (optional) or
mismatched, and what the two tokens are.
Since this program would be a useful example for my book, both
didactically and practically, I will try to brush-up a bit on C and
translate it to Python. I will use the re module for some of the
low-level token parsing, like C multibyte characters. I will then change
to tables for Python and perhaps for your challenge.
The current program assumes ascii byte input at it uses an array of
length 128 to classify ascii chars into 14 classes: 13 special for the
matching and 1 'normal' class for everything else. This could be
replaced in Python with a dict 'special' that only maps special
characters to their token class and used as "special.get(char, NORMAL)"
so that the thousands of normal characters are mapped by default to
NORMAL without a humongous array.
--
Terry Jan Reedy
More information about the Python-list
mailing list