[XML-SIG] minidom w/ HTML

Andrew Shearer andrew at shearersoftware.com
Fri Jun 25 00:35:49 EDT 2004


You could use Python's HTMLParser module[1] or my own HTMLFilter 
module[2]. Both present a SAX-like interface that calls back to your 
code as tags fly by, rather than the DOM approach of handing you a 
fully-formed, consistent data structure made from the document.

The DOM approach is complicated because of the non-well-formed nature 
of typical HTML, while the SAX-like interface is a more natural fit.

[1] http://docs.python.org/lib/module-HTMLParser.html
[2] http://www.shearersoftware.com/software/developers/htmlfilter/

> From: jennyw <jennyw at colorfulexpressions.com>
> Message-ID: <cb7co8$2cb$1 at sea.gmane.org>
>
> I have a project where I need to parse html files that are table heavy
> (a calendar, actually), and I thought minidom would be perfect for my
> needs. The problem is that the HTML that I'm trying to parse isn't 
> quite
> valid XML -- mostly minor things, but enough so that minidom won't 
> work.
>   Is there a something that would convert an html file into XML that
> would work with minidom? Or is there something better, like something
> more geared towards html that I should be looking at?

--
Andrew Shearer
Senior Analyst, Medical Computing
IS Applications Group
Lifespan




More information about the XML-SIG mailing list