Parsing SGML document in Python program

Ilya Shambat ishambat at aol.com
Mon Oct 21 15:41:27 EDT 2002


Eric Brunel <eric.brunel at pragmadev.com> wrote in message news:<aojvaj$348$1 at news-reader10.wanadoo.fr>...
> Ilya Shambat wrote:
> 
> > Hello all,
> > 
> > I need to be able to parse an SGML document in a Python program. I
> > need to know the syntax on how to do that. The project involves using
> > a DTD, passed as a command line argument, to read all the SGML files
> > that exist in a directory. Does anybody know how this is done?
> 
> There is a sgmllib module in the standard library, but it's not a full SGML 
> parser. SGML has a lot of funky possibilities that are quite hard to parse 
> and that were apparently not considered in the sgmllib module. I never used 
> it, but as far as I can see from the docs, it doesn't use a DTD, so it's 
> really not a SGML parser (XML seems to live well without a DTD, but doing 
> so in SGML may be considered as heretic ;-). It may however be usable if 
> your documents are really simple.
> 
> I had once to do that and I couldn't find a parser directly usable in 
> Python. Maybe it has changed (just check the Vaults of Parnassus for it). 
> The solution I used at the time was to rely on an external parser that gave 
> easy to parse results. The one I used was nsgmls, part of James Clark's SP 
> project. You may find it @ http://www.jclark.com/sp/ ; just test it and 
> you'll see that its output is really easy to get back into Python.

I have looked at the documentation for nsgmls, and I found it rather
inadequate. Is it possible for anyone to post an example of how this
is done? Any help on this would be greatly appreciated.



More information about the Python-list mailing list