[XML-SIG] PyXML Question

Alexandre Fayolle Alexandre.Fayolle@logilab.fr
Fri, 5 Oct 2001 11:32:18 +0200 (CEST)


On Fri, 5 Oct 2001, David Moor wrote:

> Are there any 'Introduction to PyXML' documents, describing the different
> parts and giving examples?  I have looked in the xml-howto.txt in /xmldocs,
> the section I think I need is 4.5 Processing HTML, which contains 'Intro to
> HTML builder' :)

The first thing you may want to note is that it is generally difficult to
map html to xml, and even harder to extract information from the resulting
xml. The reason for this is that html is too often used for presentation,
meaning that you get tons of nested tables in a typical html document,
quite often with badly nested elements, or misquoted attributes. 

This said, let's get into solving your problem:

the official way of creating a DOM tree is buy using a reader class, such
as xml.dom.ext.reader.Sax2.Reader class. If what you want to process html,
you'll want to use xml.dom.ext.reader.HtmlLib.Reader.

The first thing you want to do is build a new reader:
from xml.dom.ext.HtmlLib import Reader
r = Reader()

Then you can use the reader to parse the tree. A reader has 3 methods to
achieve this: fromString, fromUri and fromStream (which does the real work
for the other 2). fromString takes a string representation of the
document, fromUri takes a URL or URI string pointing to the document, and
fromStream takes a File-like object. All three methods return a Document.

doc = r.fromUri('http://www.logilab.org/')

This was the easy part. Now you still have to figure out where the
information you need is. There are no generic method for this, it all
depends on the document you're processing. I can suggest you to give a
good look at the DOM Traversal API from the W3C site, and at XPath, both
of which can be nice tools to perform such task. 

Cheers,

Alexandre Fayolle
-- 
LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org
Narval, the first software agent available as free software (GPL).