xpathEval fails for large files

Tue Jul 22 17:03:09 EDT 2008

Fredrik Lundh wrote:
> Kanchana wrote:
> 
>> I tried to extract some data with xpathEval. Path contain more than
>> 100,000 elements.
>>
>> doc = libxml2.parseFile("test.xml")
>> ctxt = doc.xpathNewContext()
>> result = ctxt.xpathEval('//src_ref/@editions')
>> doc.freeDoc()
>> ctxt.xpathFreeContext()
>>
>> this will stuck in following line and will result in high usage of
>> CPU.
>> result = ctxt.xpathEval('//src_ref/@editions')
>>
>> Any suggestions to resolve this.
> 
> what happens if you just search for "//src_ref"?  what happens if you
> use libxml's command line tools to do the same search?
> 
>> Is there any better alternative to handle large documents?
> 
> the raw libxml2 API is pretty hopeless; there's a much nicer binding
> called lxml:
> 
>     http://codespeak.net/lxml/
> 
> but that won't help if the problem is with libxml2 itself, though

It may still help a bit as lxml's setup of libxml2 is pretty memory friendly
and hand-tuned in a lot of places. But it's definitely worth trying with both
cElementTree and lxml to see what works better for you. Depending on your
data, this may be fastest in lxml 2.1:

    doc = lxml.etree.parse("test.xml")
    for el in doc.iter("src_ref"):
        attrval = el.get("editions")
        if attrval is not None:
            # do something

Stefan