a little parsing challenge ☺

Thu Jul 21 18:37:15 EDT 2011

On 7/21/2011 2:53 PM, Xah Lee wrote:

> had hopes that parser expert would show some proper parser solutions…
> in particular i think such can be expressed in Parsing Expression
> Grammar in just a few lines… but so far no deity came forward to show
> the light. lol

I am not a parser expert but 20 years ago, I wrote a program in C to 
analyze C programs for proper fence matching. My motivation was the 
often obsurity of parser error messages derived from mis-matched fences. 
I just found the printed copy and an article I wrote but did not get 
published.

Balance.c matches tokens, not characters (and hence can deal with /* and 
*/). It properly takes into account allowed nestings. For C, {[]} is 
legal, [{}] is not. Ditto for () instead of []. Nothing nests within '', 
"", and /* */. (I know some C compilers do nest /* */, but not the ones 
I used).

I initially started with a recursive descent parser but 1) this 
hard-coded the rules for one language and make changes difficult and 2) 
made the low-level parsing difficult. So I switched to a table-driven 
recursive state/action machine. The tables for your challenge would be 
much simpler as you did not specify any nesting rules, although they 
would be needed for html checking.

A key point that simplifies things a bit is that every file is 
surrounded by an unwritten BOF-EOF pair. So the machine starts with 
having 'seen' BOF and is 'looking' for EOF. So it is always looking to 
match *something*.

The total program is nearly four pages, but one page is mostly 
declarations and command-line processing, another two pages have 
typedefs, #DEFINEs, and tables. The actual main loop is about 25 lines, 
and 10 lines of that is error reporting. The output is lines with file 
name, row and columns of the two tokens matched (optional) or 
mismatched, and what the two tokens are.

Since this program would be a useful example for my book, both 
didactically and practically, I will try to brush-up a bit on C and 
translate it to Python. I will use the re module for some of the 
low-level token parsing, like C multibyte characters. I will then change 
to tables for Python and perhaps for your challenge.

The current program assumes ascii byte input at it uses an array of 
length 128 to classify ascii chars into 14 classes: 13 special for the 
matching and 1 'normal' class for everything else. This could be 
replaced in Python with a dict 'special' that only maps special 
characters to their token class and used as "special.get(char, NORMAL)" 
so that the thousands of normal characters are mapped by default to 
NORMAL without a humongous array.

-- 
Terry Jan Reedy