[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?

Fri Aug 27 09:03:53 CEST 2004

On Thu, Aug 26, 2004 at 05:30:21PM -0600, Uche Ogbuji wrote:
> On Thu, 2004-08-26 at 15:19, Daniel Veillard wrote:
> > > I don't care whether I get a DOM or a string parsable by an
> > > XML parser.
> > 
> >   libxml2 HTML parser is part of libxml2 Python bindings.
> > 
> >   import libxml2
> > 
> >   doc = libxml2.htmlParseFile(URI, None)
> >   
> > at that point doc is a DOM tree, like you would have if you had
> > parsed XML, you can use XPath, navigate, extract and reserialize.
> > You may have got a bunch of errors and warning, but you will get a
> > tree even if the HTML is really bizarre. 
> > 
> >     ctxt = doc.xpathNewContext()
> >     try:
> >         res = ctxt.xpathEval("//head/title")
> >         title = res[0].content
> >     except:
> >         title = "Page %s" % (resource)
> > 
> >   is the kind of code I use to index HTML pages and feed an
> > SQL database for searches on xmlsoft.org. I also do
> > 
> > #
> > # We are not interested in parsing errors here
> > #
> > def callback(ctx, str):
> >     return
> > libxml2.registerErrorHandler(callback, None)
> > 
> >   to ignore all error and warning since I run it as cron batches.
> 
> Cool,  but since memory leaks are the genesis of this thread (see the
> subject line), are you sure your example above takes all necessary
> memory management into account?

  in libxml2, memory management is at the document level. Once done
with a document, free it with doc.freeDoc().
All the examples in the libxml2-python package do, they also do

import libxml2

# Memory debug specific
libxml2.debugMemory(1)

at startup and

# Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
    print "OK"
else:
    print "Memory leak %d bytes" % (libxml2.debugMemory(1))
    libxml2.dumpMemory()

at the end to show that the example 1/ does not leak 2/ show how to debug
leaks.

> I've had a few surprises using examples from libxml2/Python as is, and
> finding out that they leaked significantly.  It turns out that there are
> required memory management steps omitted from the docs.

Usually this just mean doc.freeDoc() when you are done with the document.
  We take documentation patches. The fact that allocation is done at
the document level, and all document need to be freed, either at the C
or python level, has been written on list, docs and examples over and
over again. Are you subscribed to the mailing-list ?

> And more importantly: are you planning to fix it so that manual memory
> management is unnecessary when using libxml2/Python?  I know Martijn

  Me ? No. Doing reference counting over a document, each time you expose
a node though XPath query return for example is just the best way to *have*
memory leaks. I trust far more a general clear principle:
    "allocation is done at the document level"
 and then you have to keep track of the lifetime of your document
than relying on keeping ref counts for all the interfaces possible
accessing a document which may or may not keep a link on one of its
structures. 

> Faasen is working on something along those lines in lxml, but his work
> isn't really ready for "prime time" yet.

  Requires a lot of work on top of libxml2 itself. My goal is to provide
Python APIs for the library, not transmute the library calls into something
they aren't. The library does not refcount, so my python binding won't
refcount (at least for the C internal objects), the library uses UTF-8
for all document content, then my python binding will also use UTF-8
for all document content. If Martijn want to write a layer on top, fine
by me, but he will also have to maintain it.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard at redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/