[XML-SIG] Re: minidom w/ HTML
Fredrik Lundh
fredrik at pythonware.com
Fri Jun 25 04:50:27 EDT 2004
Fred L. Drake wrote:
> > Is there a something that would convert an html file into XML that
> > would work with minidom? Or is there something better, like something
> > more geared towards html that I should be looking at?
>
> You could run the HTML through HTML Tidy before parsing it as XML. This could
> be done using the HTML Tidy command line, or I think someone has built a
> Python interface to Tidy.
some alternatives:
http://effbot.org/zone/element-tidylib.htm
(note that elementtree also allows you to use command-line
versions of tidy to turn HTML into nice XHTML)
http://www.egenix.com/files/python/mxTidy.html
http://sourceforge.net/projects/utidylib
here's a short example:
import urllib
from elementtree.TidyTools import tidy
def XHTML(tag): # prepend XHTML namespace
return "{http://www.w3.org/1999/xhtml}" + tag
# grab a page and store it in a temporary file
file, message = urllib.urlretrieve("http://www.python.org")
# parse the page using the tidy command
page = tidy(file)
# find all images on this page
for image in page.findall(".//" + XHTML("img")):
print image.get("src")
for more information on element trees, see:
http://effbot.org/zone/element-index.htm
</F>
More information about the XML-SIG
mailing list