Elementtree and CDATA handling
and-google at doxdesk.com
and-google at doxdesk.com
Wed Jun 1 14:36:43 EDT 2005
Alain <alainpoint at yahoo.fr> wrote:
> I would expect a piece of XML to be read, parsed and written back
> without corruption [...]. It isn't however the case when it comes
> to CDATA handling.
This is not corruption, exactly. For most intents and purposes, CDATA
sections should behave identically to normal character data. In a real
XML-based browser (such as Mozilla in application/xhtml+xml mode), this
line of script would actually work fine:
> if (a < b && a > 0) {
The problem is you're (presumably) producing output that you want to be
understood by things that are not XML parsers, namely legacy-HTML web
browsers, which have special exceptions-to-the-rule like "<script>
elements don't contain markup" that are not present in XML.
ElementTree is a data binding that strives to simplify the XML
processing experience, and as such it folds CDATA sections down to
plain characters - this is usually easier for programmers to deal with.
Such a feature is considered normal in XML processing, and is the
default for, eg. DOM Level 3 implementations.
If, instead, you want to keep track of where the CDATA sections are,
and output them again without change, you'll need to use an
XML-handling interface that supports this feature. Typically, DOM
implementations do - the default Python minidom does, as does pxdom.
DOM is a more comprehensive but less friendly/Python-like interface for
XML processing.
There are a few other obstacles you may meet if you are outputting XML
for use by a non-XML parser (legacy browsers):
- entity references - é etc. The HTML entities are not
built into XML so to read them at all you'll need a parser that
reads the external DTD subset (and a suitable !DOCTYPE). Even then
they'll be converted to text, if that matters. (pxdom, optionally,
can keep them as entity references regardless of whether their
content is known);
- empty elements - <img/> etc. An XML serialiser won't know how to
output this is a browser-compatible way. (The next release of pxdom
has an option to do so.)
If you're generating output for legacy browsers, you might want to just
use a 'real' HTML serialiser.
--
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the Python-list
mailing list