"Intro to Pyparsing" Article at ONLamp

Sat Jan 28 10:36:59 EST 2006

"Anton Vredegoor" <anton.vredegoor at gmail.com> wrote in message
news:1138444565.250434.28030 at g47g2000cwa.googlegroups.com...
> I like your article and pyparsing. But since you ask for comments I'll
> give some. For unchanging datafile formats pyparsing seems to be OK.
> But for highly volatile data like videotext pages or maybe some html
> tables one often has the experience of failure after investing some
> time in writing a grammar because the dataformats seem to change
> between the times one uses the script.

There are two types of parsers: design-driven and data-driven.  With
design-driven parsing, you start with a BNF that defines your language or
data format, and then construct the corresponding grammar parser.  As the
design evolves and expands (new features, keywords, additional options), the
parser has to be adjusted to keep up.

With data-driven parsing, you are starting with data to be parsed, and you
have to discern the patterns that structure this data.  Data-driven parsing
usually shows this exact phenomenon that you describe, that new structures
that were not seen or recognized before arrive in new data files, and the
parser breaks.  There are a number of steps you can take to make your parser
less fragile in the face of uncertain data inputs:
- using results names to access parsed tokens, instead of relying on simple
position within an array of tokens
- anticipating features that are not shown in the input data, but that are
known to be supported (for example, the grammar expressions returned by
pyparsing's makeHTMLTags method support arbitrary HTML attributes - this
creates a more robust parser than simply coding a parser or regexp to match
"'<A HREF=' + quotedString")
- accepting case-insensitive inputs
- accepting whitespace between adjacent tokens, but not requiring it -
pyparsing already does this for you

> For example, I had this
> experience when parsing chess games from videotext pages I grab from my
> videotext enabled TV capture card. Maybe once or twice in a year
> there's a chess page with games on videotext, but videotext chess
> display format always changes slightly in the meantime so I have to
> adapt my script. For such things I've switched back to 'hand' coding
> because it seems to be more flexible.
>

Do these chess games display in PGN format (for instance, "15. Bg5 Rf8 16.
a3 Bd5 17. Re1+ Nde5")? The examples directory that comes with pyparsing
includes a PGN parser (submitted by Alberto Santini).

> What I would like to see, in order to improve on this situation is a
> graphical (tkinter) editor-highlighter in which it would be possible to
> select blocks of text from an (example) page and 'name' this block of
> text and select a grammar which it complies with, in order to assign a
> role to it later. That would be the perfect companion to pyparsing.
>
> At the moment I don't even know if such a thing would be feasible...

There are some commercial parser generator products that work exactly this
way, so I'm sure it's feasible.  Yes, this would be a huge enabler for
creating grammars.

> Thank you for your ONLamp article and for making pyparsing available. I
> had some fun experimenting with it and it gave me some insights in
> parsing grammars.
>

Glad you enjoyed it, thanks for taking the time to reply!

-- Paul