[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Chuck Bearden cbearden at hal-pc.org
Thu Aug 26 22:00:30 CEST 2004


On Thu, Aug 26, 2004 at 12:38:09PM -0600, Uche Ogbuji wrote:
> On Wed, 2004-08-25 at 14:56, Chuck Bearden wrote:
> > On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
> > >
> > > Honestly, I don't think DOM is the way I would personally go about
> > > processing HTML, which is why I was trying to get at whether there was
> > > another way for you to meet your needs.
> > 
> > I think I understand what you are getting at, but personally I have
> > found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
> > an mx.Tidying stage beforehand, to be invaluable in mining data from
> > database-generated webpages built with crappy HTML.  Consider the pages
> > displaying individual patent records at the USPTO, e.g. [1].  If you 
> > need to treat such pages as if they were XML records to be parsed and
> > loaded into a database, something like twisted.web.microdom is a big 
> > help.
> 
> Is this available without installing all of Twisted?

I confess I just took the easy way out and installed all of Twisted (as
I've done with 4Suite mostly thus far in order to use the nifty 
Domlette :-)

I haven't browsed through the dependencies to see what of the other
Twisted pieces the microdom requires, so I can't say if it is extricable
from the wider framework.

One possibility I didn't try was to use tidy to generate real XHTML from
the crappy HTML.  It might then be posssible to use something more
common like the minidom implementation to navigate the HTML.

For me, extracting data from malformed but consistent HTML is a 
necessary task, so I do sometimes have to make some compromises
in my selection and use of tools.

Chuck



More information about the XML-SIG mailing list