[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Fri Aug 20 08:52:47 CEST 2004

Quoting Uche Ogbuji <uche.ogbuji at fourthought.com>:
> On Tue, 2004-08-17 at 05:59, xmlsig at codeweld.com wrote:
> > > I've python 2.3.4 on windows xp with PyXML-0.8.3.win32-py2.3
> > >
> > > This code leaks substancialy
> > >
> > > from xml.dom.ext.reader.HtmlLib import FromHtml
> > > import urllib
> > > from xml.dom import ext
> > > s = urllib.urlopen( 'http://www.google.com' ).read()
> > > while True:
> > >     root = FromHtml( s )
> > >     ext.ReleaseNode( root )
> > >
> > > However, this does not ( or only very minor )
> > >
> > > from xml.dom.ext.reader.Sax2 import Reader
> > > import urllib
> > > from xml.dom import ext
> > > s = urllib.urlopen( 'http://www.infoworld.com/rss/reviews.xml' ).read()
> > > while True:
> > >     reader = Reader()
> > >     root = reader.fromString( s )
> > >     ext.ReleaseNode( root )
> > >
> > > Any suggestions?
> >
> > Could anybody reproduce the leak?
> > Any suggestions what I do wrong?
>
> I haven't done much work in HtmlLib since it was rewritten to use
> sgmlop.  It will take some heavy digging to find the precise memory
> leak.  What's your overall problem?  Could you use Python 2.3's
> HTMLParser library instead?

The overall problem is that the FromHtml call ( in this example )allocates some
100-200 k per loop that are not freed for the runtime of the process. The
leak's bigger when no ReleaseNode call is made.

I could of course use other means of extracting information from html, but I
thought it would not be needed to reinvent the wheel if somebody has already
written a html parser that spits out dom.