[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Daniel Veillard veillard at redhat.com
Thu Aug 26 23:19:00 CEST 2004


On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter Dörwald wrote:
> Chuck Bearden wrote:
> 
> >[...]
> >I haven't browsed through the dependencies to see what of the other
> >Twisted pieces the microdom requires, so I can't say if it is extricable
> >from the wider framework.
> >
> >One possibility I didn't try was to use tidy to generate real XHTML from
> >the crappy HTML.  It might then be posssible to use something more
> >common like the minidom implementation to navigate the HTML.
> >
> >For me, extracting data from malformed but consistent HTML is a 
> >necessary task, so I do sometimes have to make some compromises
> >in my selection and use of tools.
> 
> There are already tools that make sense of broken HTML: browsers.
> 
> Is there any way to reuse that functionality from Python? I.e.
> something like:
> 
> >>> import mozilla
> >>> x = mozilla.parse("http://www.python.org")
> 
> I don't care whether I get a DOM or a string parsable by an
> XML parser.

  libxml2 HTML parser is part of libxml2 Python bindings.

  import libxml2

  doc = libxml2.htmlParseFile(URI, None)
  
at that point doc is a DOM tree, like you would have if you had
parsed XML, you can use XPath, navigate, extract and reserialize.
You may have got a bunch of errors and warning, but you will get a
tree even if the HTML is really bizarre. 

    ctxt = doc.xpathNewContext()
    try:
        res = ctxt.xpathEval("//head/title")
        title = res[0].content
    except:
        title = "Page %s" % (resource)

  is the kind of code I use to index HTML pages and feed an
SQL database for searches on xmlsoft.org. I also do

#
# We are not interested in parsing errors here
#
def callback(ctx, str):
    return
libxml2.registerErrorHandler(callback, None)

  to ignore all error and warning since I run it as cron batches.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard at redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/


More information about the XML-SIG mailing list