[XML-SIG] Re: minidom w/ HTML

Fredrik Lundh fredrik at pythonware.com
Fri Jun 25 04:50:27 EDT 2004


Fred L. Drake wrote:

> > Is there a something that would convert an html file into XML that
> > would work with minidom? Or is there something better, like something
> > more geared towards html that I should be looking at?
>
> You could run the HTML through HTML Tidy before parsing it as XML.  This could
> be done using the HTML Tidy command line, or I think someone has built a
> Python interface to Tidy.

some alternatives:

    http://effbot.org/zone/element-tidylib.htm
    (note that elementtree also allows you to use command-line
    versions of tidy to turn HTML into nice XHTML)

    http://www.egenix.com/files/python/mxTidy.html

    http://sourceforge.net/projects/utidylib

here's a short example:

    import urllib
    from elementtree.TidyTools import tidy

    def XHTML(tag): # prepend XHTML namespace
        return "{http://www.w3.org/1999/xhtml}" + tag

    # grab a page and store it in a temporary file
    file, message = urllib.urlretrieve("http://www.python.org")

    # parse the page using the tidy command
    page = tidy(file)

    # find all images on this page
    for image in page.findall(".//" + XHTML("img")):
        print image.get("src")

for more information on element trees, see:

    http://effbot.org/zone/element-index.htm

</F>






More information about the XML-SIG mailing list