XML/XHTML/HTML differences, bugs... and howto

Thu Jan 24 01:42:22 EST 2013

Andrew Robinson, 23.01.2013 16:22:
> Good day :),
> 
> I've been exploring XML parsers in python; particularly:
> xml.etree.cElementTree; and I'm trying to figure out how to do it
> incrementally, for very large XML files -- although I don't think the
> problems are restricted to incremental parsing.
> 
> First problem:
> I've come across an issue where etree silently drops text without telling
> me; and separate.
> 
> I am under the impression that XHTML is a subset of XML (eg:defined tags),
> and that once an HTML file is converted to XHTML, the body of the document
> can be handled entirely as XML.
> 
> If I convert a (partial/contrived) html file like:
> 
> <html>
>     <div>
>         <p> This is example <b>bold</b> text.
>     </div>
> </html>
> 
> to XHTML, I might do --right or wrong-- (1):
> 
> <html>
>     <div>
>         <p /> This is example <b>bold</b> text.
>     </div>
> </html>
> 
> or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"
> 
> But, when I parse with etree,  in example (1) both "This is an example" and
> "text." are dropped;
> The missing text is part of the start, or end event tags, in the
> incrementally parsed method.
> 
> Likewise: In example (2), only "text" gets dropped.

Nope, you should read the manual on this. Here's a tutorial:

http://lxml.de/tutorial.html#elements-contain-text

This is using lxml.etree, which is the Python XML library most people use
these days. It's ElementTree compatible, so the tutorial also works for ET
(unless stated otherwise).

> Isn't XML supposed to error out when invalid xml is parsed?

It does.

> I have an XML file which will grow larger than memory on a target machine,
> so here's what I want to do:
> 
> Given a source XML file, and a destination file:
> 1) iteratively scan part of the source tree.
> 2) Optionally Modify some of scanned tree.
> 3) Write partial scan/tree out to the destination file.
> 4) Free memory of no-longer needed (partial) source XML.
> 5) continue scanning a new section of the source file... eg: goto step 1
> until source file is exhausted.
> 
> But, I don't see a way to write portions of an XML tree, or iteratively
> write a tree to disk.
> How can this be done?

There are several ways to do it. Python has a couple of external libraries
available that are made specifically for generating markup incrementally.

lxml also gained that feature recently. It's not documented yet, but here
are usage examples:

https://github.com/lxml/lxml/blob/master/src/lxml/tests/test_incremental_xmlfile.py

Stefan