Ayuda con libxml2 (memory leak procesando html)

Vie Ene 5 11:38:07 CET 2007

Feliz año foreros!,

Tengo un problemilla con libxml2. A ver si por ahí hay alguien que se haya
enfrentado antes con los mismo... (aunque ya lo he posteado también en la
lista de libxml2).

Es una aplicación perdemos memoria y para ver si libxml2 era el culpable he
modificado uno de los ejemplos que hay en la pagina web para procesar un
número relevante de ficheros html, mientras observo el consumo de memoria
con el comando top.
Y... ¡Si! con el ejemplo aislado puedo ver como el consumo de memoria se
incrementa sin parar.

Y nada mas... a ver si me podeis echar una mano. En caso contrario.. ojito
con libxml2 y el consumo de memoria.
Si por otro lado encuentro la solución la postearé.

Saludos, César

Nota1: En el callback no hago nada
Nota2: He probado ya a meter las funciones de cleanup dentro del bucle.

****************************************] El código
[****************************************

#!/usr/bin/python -u
import libxml2

#------------------------------------------------------------------------------

# Memory debug specific
libxml2.debugMemory(1)

#------------------------------------------------------------------------------

class callback:
    def startDocument(self):
            print "."

    def endDocument(self):
        pass

    def startElement(self, tag, attrs):
        pass

    def endElement(self, tag):
        pass

    def characters(self, data):
        pass

    def warning(self, msg):
        pass

    def error(self, msg):
        pass

    def fatalError(self, msg):
        pass

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

import os
import sys

programName = os.path.basename(sys.argv[0])

if len(sys.argv) != 2:
  print "Use: %s <dir html files>" % programName
  sys.exit(1)

inputPath = sys.argv[1]

if not os.path.exists(inputPath):
  print "Error: directory does not exist"
  sys.exit(1)

inputFileNames = []
dirContent = os.listdir(inputPath)
for fichero in dirContent:
  extension1=fichero.rfind(".htm")
  extension2=fichero.rfind(".html")
  dot = fichero.rfind(".")
  extension = max(extension1,extension2)
  if extension != -1 and extension == dot:
      inputFileNames.append(fichero)

if len(inputFileNames) == 0:
  print "Error: no input files"
  sys.exit(1)

handler = callback()
NUM_ITERS = 5
for i in range(NUM_ITERS):
  for inputFileName in inputFileNames:
    print inputFileName
    inputFilePath = inputPath + inputFileName
    f = open(inputFilePath)
    data = f.read()
    f.close()

    ctxt = libxml2.htmlCreatePushParser(handler, "", 0, inputFileName)

    ctxt.htmlParseChunk(data, len(data), 1)
    ctxt = None

# Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
    print "OK"
else:
    print "Memory leak %d bytes" % (libxml2.debugMemory(1))
    libxml2.dumpMemory()

# Other cleanup functions
#libxml2.cleanupCharEncodingHandlers()
#libxml2.cleanupEncodingAliases()
#libxml2.cleanupGlobals()
#libxml2.cleanupInputCallbacks()
#libxml2.cleanupOutputCallbacks()
#libxml2.cleanupPredefinedEntities()