[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Mon Aug 23 18:31:11 CEST 2004

On Fri, 2004-08-20 at 00:52, xmlsig at codeweld.com wrote:
> Quoting Uche Ogbuji <uche.ogbuji at fourthought.com>:
> > On Tue, 2004-08-17 at 05:59, xmlsig at codeweld.com wrote:
> > > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> > > >
> > > > This code leaks substancialy
> > > >
> > > > from xml.dom.ext.reader.HtmlLib import FromHtml
> > > > import urllib
> > > > from xml.dom import ext
> > > > s = urllib.urlopen( 'http://www.google.com' ).read()
> > > > while True:
> > > >     root = FromHtml( s )
> > > >     ext.ReleaseNode( root )
> > > >
> > > > However, this does not ( or only very minor )
> > > >
> > > > from xml.dom.ext.reader.Sax2 import Reader
> > > > import urllib
> > > > from xml.dom import ext
> > > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read()
> > > > while True:
> > > >     reader = Reader()
> > > >     root = reader.fromString( s )
> > > >     ext.ReleaseNode( root )
> > > >
> > > > Any suggestions?
> > >
> > > Could anybody reproduce the leak?
> > > Any suggestions what I do wrong?
> >
> > I haven't done much work in HtmlLib since it was rewritten to use
> > sgmlop.  It will take some heavy digging to find the precise memory
> > leak.  What's your overall problem?  Could you use Python 2.3's
> > HTMLParser library instead?
> 
> The overall problem is that the FromHtml call ( in this example )allocates some
> 100-200 k per loop that are not freed for the runtime of the process. The
> leak's bigger when no ReleaseNode call is made.

By "overall problem" I mean what are you actually trying to do/achieve. 
Since no one has been able to step up to diagnose the memory leak, I'm
looking to see whether there is another solution that would work for
you.

> I could of course use other means of extracting information from html, but I
> thought it would not be needed to reinvent the wheel if somebody has already
> written a html parser that spits out dom.

Honestly, I don't think DOM is the way I would personally go about
processing HTML, which is why I was trying to get at whether there was
another way for you to meet your needs.

I'm sorry that my workload is so heavy that there is no chance I could
work on figuring out a 4DOM memory leak right now.

Best of luck.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Practical (Python) SAX Notes - http://www.xml.com/pub/a/2004/08/11/py-xml.html
XML circles the globe - http://www.javareport.com/article.asp?id=9797
Element structures for names and addresses - http://www.ibm.com/developerworks/xml/library/x-elemdes.html
Commentary on "Objects. Encapsulation. XML?" - http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML - http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/