[XML-SIG] lxml iterparse and comments
Stuart McGraw
smcg4191 at frii.com
Tue Mar 25 05:19:15 CET 2008
Hello Stefan,
Thanks for your response.
> Stuart McGraw wrote:
> > I am probably mising something elementary (I am new
> > to both xml and lxml), but I am having problems figuring
> > out how to get comments when using lxml's iterparse().
> > When I parse xml with parse() and iterate though the
> > result, I get the comments. But when I try to do the
> > same thing (approximately I think) with iterparse,
> > I don't see any comments.
>
> While the comments end up in the tree that iterparse generates,
> they do not show up in the events. Now that you mention it, I
> actually think that should change. There should be events
> "comment" and "pi" that yield them if requested.
That would be ideal, from my perspective. It also seems
more consistent with the other interfaces (parse, parse target,
etc)
> > I was using the standard Python ElementTree but my
> > understanding is that it doesn't save comments at all.
>
> ElementTree strips comments in the parser, that's right.
>
> > The real file is ~50MB and has about 1M nodes under the
> > root so I have to use iterparse and I also have to process
> > comments, so I would really appreciate a clue about how
> > to do it. Thanks.
>
> Have you tried the parser target interface? It's a SAX-like
> interface that uses callbacks.
>
> http://codespeak.net/lxml/parsing.html#the-target-parser-interface
>
http://effbot.org/elementtree/elementtree-xmlparser.htm#the-target-interfa
ce
Thanks for pointing that out. I'd seen it in the docs but
hadn't appreciated that it was relevant. However, I am
having trouble getting it to work. Specifically, the test
code below produces the output I expected when run with
cElementTree, but with lxml, it is missing "end" callbacks,
the second "start(entry) " callback, and the resolved entity
text. Am I doing something wrong?
Test code:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#import xml.etree.cElementTree as ET
import lxml.etree as ET
from cStringIO import StringIO
# XML data...
#=============================================
xmltxt = \
'''<?xml version="1.0" encoding="UTF-8"?>
<!-- Rev 1.06
-->
<!DOCTYPE Test [
<!ELEMENT Test (entry*)>
<!ELEMENT entry (#PCDATA)>
<!-- Description of <entry> element.
-->
<!ENTITY ex "an existential entity">
]>
<!-- File created: 2008-02-27 -->
<Test>
<!-- Chronosynclastic Infindibulum Listing -->
<entry>text 1 is &ex;</entry>
<!-- Deleted: A1500477 -->
<entry>text 2</entry>
</Test>'''
#=============================================
print '\nTargetParser:\n-------------'
try: XMLParser = ET.XMLParser
except AttributeError: XMLParser = ET.XMLTreeBuilder
class EchoTarget:
def comment(self, tag):
print "comment", tag
def start(self, tag, attrib):
print "start", tag, attrib
def end(self, tag):
print "end", tag
def data(self, data):
print "data", repr(data)
def close(self):
print "close"
return "closed!"
parser = XMLParser( target = EchoTarget())
result = ET.parse( StringIO (xmltxt), parser)
More information about the XML-SIG
mailing list