4DOM eating all my memory

Sun Feb 1 19:27:09 EST 2004

ewan <frimn at hotmail.com> writes:

> I'm looping over a set of urls pulled from a database, fetching the
> corresponding webpage, and building a DOM tree for it using
> xml.dom.ext.reader.HtmlLib (then trying to match titles in a web library
> catalogue).

Hmm, if this is open-source and it's more than a quick hack, let me
know when you have it working, I maintain a page on open-source stuff
of this nature (bibliographic and cataloguing).

>  all the trees seem to be kept in memory,
> 
> however, when I get through fifty or so iterations the program has used
> about half my memory and slowed the system to a crawl.
> 
> tried turning on all gc debugging flags. they produce lots of output, but it
> all says 'collectable' - sounds fine to me.

I've never had to resort to this... does it tell you what types /
classes are involved?  IIRC, there was some code posted to python-dev
to give hints about this (though I guess that was mostly/always for
debugging leaks at the C level).

> I even tried doing gc.collect() at the end of every iteration. nothing.
> everything seems to be being collected. so why does each iteration increase
> the memory usage by several megabytes?
> 
> below is some code (and by the way, do I have those 'global's in the right
> places?)

Yes, they're in the right places.  Not sure a global is really needed,
though...

> any suggestions would be appreciated immeasurably...
[...]
>   def find(self, title, uri):
>     global root
>     
>     reader = HtmlLib.Reader()
>     root = reader.fromUri(uri)
> 
>     # find what we're looking for
>     ...

+     reader.releaseNode(root)

?

John