[XML-SIG] parsing CDATA section

Fri Dec 30 01:25:10 CET 2005

Ankit Rastogi wrote:
> I am having problem with parsing CDATA section. I am using PyXml and minidom for parsing the xml document. 
>   My motive is to get the data back in the same format in one string as it is writen in xml file. Here is the sample:
>   --
>   <StateChg>
>       <![CDATA[
>                 check.. its cdata section
>   
>                  all data is printed in it s format
>                  ]]>
> <StateChg>
>   --
>   put when I print all its childs using:
>   ---
>    print StChg.childNodes  #StChg. is instance to <StateChg> element
>   --
>   It gives following output
>   --
>   [<DOM Text node "\n">, <DOM Text node "\t\t\t\t \t">, <DOM Text node "\n">, <DOM Text node "\t\t\t\t\t chec...">, <DOM Text node "\n">, <DOM Text node "\t\t\t\t\t all ...">, <DOM Text node "\n">, <DOM Text node "\t\t\t\t\t ">, <DOM Text node "\n">, <DOM Text node "\t\t\t\t\t">]
> --
>    
>   The output Shows it is text node but we had declare it as CDATA_SECTION_NODE.
>    
>   and also the output is not desired ( format lost and some data is lost), 
>    
>   Why its happening. What I have to do to get the same output as in xml with the format and indentation.
>    
>   Please,correct me, where I am wrong

In XML, you have the logical constructs: elements, attributes, character data,
processing instructions, and comments. You use markup to represent these
constructs. And you are currently using the DOM API to access an abstract
representation of them -- an implicit tree of nodes.

At the markup level, a span of character data can be written using either (1)
literal characters, numeric character references, and entity references, or
(2) a CDATA section, consisting of literal characters only, bounded by start
and end markers. There is no semantic difference between the two ways of
writing character data; it is just two different ways of writing the same
thing. Thus "1 &amp; 2 are &lt; 3" in regular markup is exactly the same as "1
& 2 are < 3" in a CDATA section.

It is common for a parser to report each span of character data separately. It
may say "this character data was written using a CDATA section" and "this
character data was written with regular markup"; or it may just say "I saw
this character data, and then I saw this other character data". It is also
possible that character and entity references in the markup will be treated as
separate spans of character data. Very long spans might be split as well. 

Consequently, in both the SAX and DOM APIs, these separate reports from the
parser *may* manifest as separate, subsequent 'characters' events (in SAX) or
as separate Text nodes (in DOM). You must be prepared to see them in chunks.

You must also realize that it is not incorrect to see CDATA sections as Text
nodes in an implementation that only supports the Core Interfaces of DOM.

DOM does have a CDATASection node, which is a subclass of Text, but it is only
in the Extended Interfaces, which are optional. So if an implementation
chooses to support DOM's Extended Interfaces, then CDATA will manifest as
CDATASection instead of Text.

CDATASection nodes are in fact supported in newer versions of minidom, despite
the docs at http://python.org/doc/2.4.2/lib/minidom-and-dom.html which say
otherwise.  These nodes and some of the other extended interfaces blur the
distinction between lexical markup and logical constructs that the markup is
intended to represent, so they actually make things more difficult for users,
typically, which is why they're optional.

As Dieter Maurer pointed out, you can merge adjacent Text nodes by calling the
normalize() method on any ancestor of the nodes. However, by design, this only
works on Text nodes, not CDATASection nodes, as per DOM requirements.

Python 2.2.3:
>>> from xml.dom.minidom import getDOMImplementation, parseString
>>> impl = getDOMImplementation()
>>> impl.hasFeature('Core', '2.0') # core interfaces?
1
>>> impl.hasFeature('XML', '2.0')  # extended interfaces?
0
>>> doc = parseString('<test>1 &amp; 2 are &lt; 3 ... <![CDATA[1 & 2 are < 3]]></test>')
>>> doc.childNodes[0].childNodes
[<DOM Text node "1 ">, <DOM Text node "&">, <DOM Text node " 2 are ">, <DOM Text node "<">, <DOM Text node " 3 ... ">, <DOM Text node "1 & 2 are ...">]
>>> doc.normalize()
>>> doc.childNodes[0].childNodes
[<DOM Text node "1 & 2 are ...">]
>>> doc.childNodes[0].childNodes[0].data
u'1 & 2 are < 3 ... 1 & 2 are < 3'

Python 2.4.2:
>>> from xml.dom.minidom import getDOMImplementation, parseString
>>> impl = getDOMImplementation()
>>> from xml.dom.minidom import parseString
>>> impl.hasFeature('Core', '2.0') # core interfaces?
True
>>> impl.hasFeature('XML', '2.0')  # extended interfaces?
True
>>> doc = parseString('<test>1 &amp; 2 are &lt; 3 ... <![CDATA[1 & 2 are < 3]]></test>')
>>> doc.childNodes[0].childNodes
[<DOM Text node "1 & 2 are ...">, <DOM CDATASection node "1 & 2 are ...">]
>>> doc.normalize()
>>> doc.childNodes[0].childNodes
[<DOM Text node "1 & 2 are ...">, <DOM CDATASection node "1 & 2 are ...">]
>>> doc.childNodes[0].childNodes[0].data
u'1 & 2 are < 3 ... '
>>> doc.childNodes[0].childNodes[1].data
u'1 & 2 are < 3'

If you need to merge adjacent CDATASection nodes and/or mixed Text and
CDATASection nodes, there are no functions built-in to do that. You'll have to
roll your own. There's no way to disable the creation of CDATASection nodes in
minidom.

Anyway, you should not expect to be able to precisely reproduce or even know
exactly what was in the lexical markup in your original, unparsed document
when you run it through a parser, and especially not after you access the
parser's reports through a relatively abstract API like DOM or SAX or the
XPath tree model. You can produce XML that *means* the same thing as the
original, but you're not going to get XML That *looks* exactly like the
original. If you expect to do that, then you shouldn't be running your XML
through a real XML parser at all.

Mike