Elementtree and CDATA handling

Wed Jun 1 14:36:43 EDT 2005

Alain <alainpoint at yahoo.fr> wrote:

> I would expect a piece of XML to be read, parsed and written back
> without corruption [...]. It isn't however the case when it comes
> to CDATA handling.

This is not corruption, exactly. For most intents and purposes, CDATA
sections should behave identically to normal character data. In a real
XML-based browser (such as Mozilla in application/xhtml+xml mode), this
line of script would actually work fine:

> if (a < b && a > 0) {

The problem is you're (presumably) producing output that you want to be
understood by things that are not XML parsers, namely legacy-HTML web
browsers, which have special exceptions-to-the-rule like "<script>
elements don't contain markup" that are not present in XML.

ElementTree is a data binding that strives to simplify the XML
processing experience, and as such it folds CDATA sections down to
plain characters - this is usually easier for programmers to deal with.
Such a feature is considered normal in XML processing, and is the
default for, eg. DOM Level 3 implementations.

If, instead, you want to keep track of where the CDATA sections are,
and output them again without change, you'll need to use an
XML-handling interface that supports this feature. Typically, DOM
implementations do - the default Python minidom does, as does pxdom.
DOM is a more comprehensive but less friendly/Python-like interface for
XML processing.

There are a few other obstacles you may meet if you are outputting XML
for use by a non-XML parser (legacy browsers):

  - entity references - é etc. The HTML entities are not
    built into XML so to read them at all you'll need a parser that
    reads the external DTD subset (and a suitable !DOCTYPE). Even then
    they'll be converted to text, if that matters. (pxdom, optionally,
    can keep them as entity references regardless of whether their
    content is known);

  - empty elements - <img/> etc. An XML serialiser won't know how to
    output this is a browser-compatible way. (The next release of pxdom
    has an option to do so.)

If you're generating output for legacy browsers, you might want to just
use a 'real' HTML serialiser.

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/