Parsing library for Python?
Harry George
harry.g.george at boeing.com
Tue Feb 24 05:53:44 EST 2004
"Edward C. Jones" <edcjones at erols.com> writes:
> Tim Roberts wrote:
> > "Edward C. Jones" <edcjones at erols.com> wrote:
> >
> >> When looking for a parser generator, I think it is important that
> >> full grammars be provided for at least C and Python and preferably
> >> for C++, Java, and FORTRAN.
> > Are you kidding with this? I can't tell.
> > C, C++, and Fortran are parsing nightmares, where end-of-line and
> > spacing
> > are important sometimes and ignored at other times, and so on.
> > I expect to find the canonical desk calculator example, and perhaps a
> > Pascal-based language, but any more than that is asking a bit much from all
> > but the most mature parser generators.
>
> Not kidding. Nothing can be parsed without a grammar. I think parsing
> the standard computer languages is a common need. I am sporatically
> developing software to automatically generate Pyrex code for wrapping
> C libraries in Python. I use ANTLR because it comes with a good C
> grammar.
>
> And then there is HTML. I wonder how Mozilla parses all the ill-formed
> html that is on the web.
Yes, things can be parsed without a grammar, or at least without a
conventional CFG. Ad hoc parsers are so messy, of course, that we try
to avoid that in modern languages. But I've parsed textual documents
at times with context-sensitive RR(2) approaches and other oddities.
The point is that FORTRAN predates clear understanding of
line-independent lexing and Context Free grammars (CFG's). It uses
constructs which are not handled by the classic
scanner/lexer/parser/AST tools. I don't know how the pros handle
this, but when I run into a non-std grammar, I preprocess to tag it
with additional tokens, and then run it through a std lexer/parser.
Basically a tree re-writer approach.
C++ is (I think) classically lexable, but the semantics are so complex
that parsing (or understanding what to do with the parse) is a pain.
I wasn't in that business, but I understand C compiler vendors bombed
out trying to just upgrade C compilers and had to start fresh with a
much richer type model. SWIG also ran into this.
For parsing of "bad html", see "tidy". Its lexer/parser is ad hoc
(not generated by parser toolkits).
--
harry.g.george at boeing.com
6-6M21 BCA CompArch Design Engineering
Phone: (425) 342-0007
More information about the Python-list
mailing list