[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Uche Ogbuji uche.ogbuji at fourthought.com
Fri Aug 27 01:30:21 CEST 2004


On Thu, 2004-08-26 at 15:19, Daniel Veillard wrote:
> On Thu, Aug 26, 2004 at 10:24:38PM +0200, Walter Dörwald wrote:
> > Chuck Bearden wrote:
> > 
> > >[...]
> > >I haven't browsed through the dependencies to see what of the other
> > >Twisted pieces the microdom requires, so I can't say if it is extricable
> > >from the wider framework.
> > >
> > >One possibility I didn't try was to use tidy to generate real XHTML from
> > >the crappy HTML.  It might then be posssible to use something more
> > >common like the minidom implementation to navigate the HTML.
> > >
> > >For me, extracting data from malformed but consistent HTML is a 
> > >necessary task, so I do sometimes have to make some compromises
> > >in my selection and use of tools.
> > 
> > There are already tools that make sense of broken HTML: browsers.
> > 
> > Is there any way to reuse that functionality from Python? I.e.
> > something like:
> > 
> > >>> import mozilla
> > >>> x = mozilla.parse("http://www.python.org")
> > 
> > I don't care whether I get a DOM or a string parsable by an
> > XML parser.
> 
>   libxml2 HTML parser is part of libxml2 Python bindings.
> 
>   import libxml2
> 
>   doc = libxml2.htmlParseFile(URI, None)
>   
> at that point doc is a DOM tree, like you would have if you had
> parsed XML, you can use XPath, navigate, extract and reserialize.
> You may have got a bunch of errors and warning, but you will get a
> tree even if the HTML is really bizarre. 
> 
>     ctxt = doc.xpathNewContext()
>     try:
>         res = ctxt.xpathEval("//head/title")
>         title = res[0].content
>     except:
>         title = "Page %s" % (resource)
> 
>   is the kind of code I use to index HTML pages and feed an
> SQL database for searches on xmlsoft.org. I also do
> 
> #
> # We are not interested in parsing errors here
> #
> def callback(ctx, str):
>     return
> libxml2.registerErrorHandler(callback, None)
> 
>   to ignore all error and warning since I run it as cron batches.

Cool,  but since memory leaks are the genesis of this thread (see the
subject line), are you sure your example above takes all necessary
memory management into account?

I've had a few surprises using examples from libxml2/Python as is, and
finding out that they leaked significantly.  It turns out that there are
required memory management steps omitted from the docs.

And more importantly: are you planning to fix it so that manual memory
management is unnecessary when using libxml2/Python?  I know Martijn
Faasen is working on something along those lines in lxml, but his work
isn't really ready for "prime time" yet.

Thanks.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Meet me at XMLOpen Sept 21-23 2004, Cambridge, UK.  http://xmlopen.org

Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/



More information about the XML-SIG mailing list