Parsing library for Python?

Harry George harry.g.george at boeing.com
Tue Feb 24 05:53:44 EST 2004


"Edward C. Jones" <edcjones at erols.com> writes:

> Tim Roberts wrote:
> > "Edward C. Jones" <edcjones at erols.com> wrote:
> >
> >> When looking for a parser generator, I think it is important that
> >> full grammars be provided for at least C and Python and preferably
> >> for C++, Java, and FORTRAN.
> > Are you kidding with this?  I can't tell.
> > C, C++, and Fortran are parsing nightmares, where end-of-line and
> > spacing
> > are important sometimes and ignored at other times, and so on.
> > I expect to find the canonical desk calculator example, and perhaps a
> > Pascal-based language, but any more than that is asking a bit much from all
> > but the most mature parser generators.
> 
> Not kidding. Nothing can be parsed without a grammar. I think parsing
> the standard computer languages is a common need. I am sporatically
> developing software to automatically generate Pyrex code for wrapping
> C libraries in Python. I use ANTLR because it comes with a good C
> grammar.
> 
> And then there is HTML. I wonder how Mozilla parses all the ill-formed
> html that is on the web.


Yes, things can be parsed without a grammar, or at least without a
conventional CFG.  Ad hoc parsers are so messy, of course, that we try
to avoid that in modern languages.  But I've parsed textual documents
at times with context-sensitive RR(2) approaches and other oddities.

The point is that FORTRAN predates clear understanding of
line-independent lexing and Context Free grammars (CFG's).  It uses
constructs which are not handled by the classic
scanner/lexer/parser/AST tools.  I don't know how the pros handle
this, but when I run into a non-std grammar, I preprocess to tag it
with additional tokens, and then run it through a std lexer/parser.
Basically a tree re-writer approach.

C++ is (I think) classically lexable, but the semantics are so complex
that parsing (or understanding what to do with the parse) is a pain.
I wasn't in that business, but I understand C compiler vendors bombed
out trying to just upgrade C compilers and had to start fresh with a
much richer type model.  SWIG also ran into this.

For parsing of "bad html", see "tidy".  Its lexer/parser is ad hoc
(not generated by parser toolkits).


-- 
harry.g.george at boeing.com
6-6M21 BCA CompArch Design Engineering
Phone: (425) 342-0007



More information about the Python-list mailing list