[XML-SIG] lxml iterparse and comments

Stuart McGraw smcg4191 at frii.com
Tue Mar 25 05:19:15 CET 2008


Hello Stefan,

Thanks for your response.

> Stuart McGraw wrote:
> > I am probably mising something elementary (I am new
> > to both xml and lxml), but I am having problems figuring
> > out how to get comments when using lxml's iterparse().
> > When I parse xml with parse() and iterate though the
> > result, I get the comments.  But when I try to do the
> > same thing (approximately I think) with iterparse,
> > I don't see any comments.
>
> While the comments end up in the tree that iterparse generates, 
> they do not show up in the events. Now that you mention it, I
> actually think that should change. There should be events
>  "comment" and "pi" that yield them if requested.

That would be ideal, from my perspective.  It also seems
more consistent with the other interfaces (parse, parse target,
etc)

> > I was using the standard Python ElementTree but my
> > understanding is that it doesn't save comments at all.
>
> ElementTree strips comments in the parser, that's right.
>
> > The real file is ~50MB and has about 1M nodes under the
> > root so I have to use iterparse and I also have to process
> > comments, so I would really appreciate a clue about how
> > to do it.  Thanks.
>
> Have you tried the parser target interface? It's a SAX-like
> interface that uses callbacks.
>
> http://codespeak.net/lxml/parsing.html#the-target-parser-interface
>
http://effbot.org/elementtree/elementtree-xmlparser.htm#the-target-interfa
ce

Thanks for pointing that out.  I'd seen it in the docs but
hadn't appreciated that it was relevant.  However, I am
having trouble getting it to work.  Specifically, the test
code below produces the output I expected when run with
cElementTree, but with lxml, it is missing "end" callbacks,
the second "start(entry) " callback, and the resolved entity
text.  Am I doing something wrong?

Test code:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#import xml.etree.cElementTree as ET
import lxml.etree as ET
from cStringIO import StringIO

# XML data...
#=============================================
xmltxt = \
'''<?xml version="1.0" encoding="UTF-8"?>
<!-- Rev 1.06
-->
<!DOCTYPE Test [
<!ELEMENT Test (entry*)>
<!ELEMENT entry (#PCDATA)>
	<!-- Description of <entry> element.
	-->
<!ENTITY ex "an existential entity">
]>
<!-- File created: 2008-02-27 -->
<Test>
<!--  Chronosynclastic Infindibulum Listing -->
<entry>text 1 is &ex;</entry>
<!-- Deleted:  A1500477 -->
<entry>text 2</entry>
</Test>'''
#=============================================

print '\nTargetParser:\n-------------'

try:                   XMLParser = ET.XMLParser
except AttributeError: XMLParser = ET.XMLTreeBuilder

class EchoTarget:
    def comment(self, tag):
        print "comment", tag
    def start(self, tag, attrib):
        print "start", tag, attrib
    def end(self, tag):
        print "end", tag
    def data(self, data):
        print "data", repr(data)
    def close(self):
        print "close"
        return "closed!"

parser = XMLParser( target = EchoTarget())
result = ET.parse( StringIO (xmltxt), parser)


More information about the XML-SIG mailing list