"Intro to Pyparsing" Article at ONLamp

Sun Jan 29 10:23:55 EST 2006

Paul McGuire wrote:

> There are two types of parsers: design-driven and data-driven.  With
> design-driven parsing, you start with a BNF that defines your language or
> data format, and then construct the corresponding grammar parser.  As the
> design evolves and expands (new features, keywords, additional options), the
> parser has to be adjusted to keep up.
>
> With data-driven parsing, you are starting with data to be parsed, and you
> have to discern the patterns that structure this data.  Data-driven parsing
> usually shows this exact phenomenon that you describe, that new structures
> that were not seen or recognized before arrive in new data files, and the
> parser breaks.  There are a number of steps you can take to make your parser
> less fragile in the face of uncertain data inputs:
> - using results names to access parsed tokens, instead of relying on simple
> position within an array of tokens
> - anticipating features that are not shown in the input data, but that are
> known to be supported (for example, the grammar expressions returned by
> pyparsing's makeHTMLTags method support arbitrary HTML attributes - this
> creates a more robust parser than simply coding a parser or regexp to match
> "'<A HREF=' + quotedString")
> - accepting case-insensitive inputs
> - accepting whitespace between adjacent tokens, but not requiring it -
> pyparsing already does this for you

I'd like to add another parser type, lets call this a natural language
parser type. Here we have to quickly adapt to human typing errors or
problems with the tranmission channel. I think videotext pages offer
both kinds of challenges, so could provide good training material. Of
course in such circumstances it seems to be hardly possible for a
computer alone to produce correct parsing. Sometimes I even have to
start up a chess program to inspect a game after parsing it into a pgn
file and correct unlikely or impossible move sequences. So since we're
now into human assisted parsing anyway, the most gain would be made in
further inproving the user interface?

> > For example, I had this
> > experience when parsing chess games from videotext pages I grab from my
> > videotext enabled TV capture card. Maybe once or twice in a year
> > there's a chess page with games on videotext, but videotext chess
> > display format always changes slightly in the meantime so I have to
> > adapt my script. For such things I've switched back to 'hand' coding
> > because it seems to be more flexible.
> >
>
> Do these chess games display in PGN format (for instance, "15. Bg5 Rf8 16.
> a3 Bd5 17. Re1+ Nde5")? The examples directory that comes with pyparsing
> includes a PGN parser (submitted by Alberto Santini).

Ah, now I remember, I think this was what got me started on pyparsing
some time ago. The dutch videotext pages are online too (and there's a
game today):

http://teletekst.nos.nl/tekst/683-01.html

But as I said there can be transmission errors and human errors. And
the dutch notation is used, for example a L is a B, a P is a K, D is Q,
T is R. I'd be interested in a parser that could make inferences about
chess games and use it to correct these pages!

> > What I would like to see, in order to improve on this situation is a
> > graphical (tkinter) editor-highlighter in which it would be possible to
> > select blocks of text from an (example) page and 'name' this block of
> > text and select a grammar which it complies with, in order to assign a
> > role to it later. That would be the perfect companion to pyparsing.
> >
> > At the moment I don't even know if such a thing would be feasible...
>
> There are some commercial parser generator products that work exactly this
> way, so I'm sure it's feasible.  Yes, this would be a huge enabler for
> creating grammars.

And pave the way for a natural language parser. Maybe there's even some
(sketchy) path now to link computer languages and natural languages. In
my mind Python has always been closer to human languages than other
programming languages. From what I learned about it, language
recognition is the easy part, language production is what is hard. But
even the easy part has a long way to go, and since we're also using a
*visual* interface for something that in the end originates from sound
sequences (even what I type here is essentially a representation of a
verbal report) we have ultimately a difficult switch back to auditory
parsing ahead of us.

But in the meantime the tools produced (even if only for text parsing)
are already useful and entertaining. Keep up the good work.

Anton.