[XML-SIG] CDATA sections still not handled

Wed, 17 Jan 2001 22:14:49 +1300

ok, so now I am getting somewhere in understanding this .... more comments below

On Wed, 17 Jan 2001, Martin v. Loewis wrote:
> > Since one has NO interest in parsing the content, rendering, or
> > interpreting it, but does have an interest in locating a particular
> > node and adding a new fragment to it, then saving the modifed
> > document, via ext.PrettyPrint(which I am using), to file again,
> 
> I understand you are not interested in parsing the document; if you
> build a DOM tree, parsing of the document will happen as a side
> effect. You cannot avoid this: this is the only way to get a DOM tree
> from a document. So while you are not interested in the parsing, you
> should accept that it is done.

This is where I see the extra step that is necessary, so tell me if I am on the
right track.  A CDATA section that contains xml will be translated by a parser
into a text node that is still valid by virtue of the character references that
it places in place of characters such as "<" ... i.e. &lt;, and that for
example if they wrote some naff xml in an input , eg "&&<name><<" this, if
escaped in the original document by CDAT, would be translated into a text node
with "&amp;&amp;&lt;name>&lt;&lt;".  Now if that CDATA was supposed to be xml
as well, but was necessarily hidden for a while so that validation could be
performed further along a processing chain, then I also need to write a
processor to replace the character references, in which case I could possibly
define <!ENTITY> s for such a translation, so that the parser would see <
instead of &lt;

> 
> > then one obviously does not want CDATA markers to be removed,
> > because, 1) they may have not written the first document, and 2)
> > they are not trying to interpret it,
> 
> Who is "they" here? The CDATA markers? or the users of your tool?
> 

many people who pick up a document and modify it and put it back.

> So somebody has not written the document, and that same
> person/entity/whatever is not trying to interpret it. Why does it
> follow that this person/entity does not want the CDATA markers to be
> removed? If that person does not even look at the document, why is
> there any harm done by removing the CDATA markers. They have *no*
> meaning in the document.

Just the above, one wants to take the CDATA at some point and treat it as
either an xml document on its own, or just part of the current xml document. 
The CDATA simply being used to escape sections that could possibly break
validation at earlier points, eg on a server, where there may be no chance of
handling bad xml sections, but that at a later point, eg some client
application, then an exception can be handled nicely, in which case the CDATA
section can now be safely interpreted.  This is where I see I need reverse
translation, and simply cannot directly parse what use to be a CDATA section.

> 
> > You missed the point entirely in that I don't care where they are in
> > the document.
> 
> I assume "they" is the CDATA markers, here. If you don't care where
> they are in the document, why is it a problem if there is no CDATA
> marker in the output of PrettyPrint?

as above

> 
> > maybe the following will explain why it is useful ..... which is the
> > hack I use to get CDATA back into the file again.  Presumably you
> > would think that if you opened an xml file into a DOM tree, then
> > saved it again, then it would still be the same "kind" of document,
> 
> That I would think. It should still be the same "kind" of document,
> i.e. have the same elements, the elements should have the same
> attributes, and elements containing text should still contain the same
> text.
> 
> > i.e. CDATA nodes would STILL be CDATA nodes.
> 
> No, I would not think that. Changing CDATA nodes to text does not
> change the document; it is still the same one. Replacing CDATA
> fragments with text is the same kind of transformation as replacing
> &lt; with &#60; - this does not change the document.
> 
> > Yes I assume 1) the node name is unique and 2) that it's first child is a
> > text node ......
> > 
> > def convertTextNodeToCDataNodeByName(doc,name):
> >     node_list = doc.getElementsByTagNameNS('',name)
> >     text_node = node_list[0].firstChild
> >     text_data = retPrettyPrint(text_node)
> >     new_cdata_node = makeCDataSection(doc,text_data)
> >     text_node.parentNode.replaceChild(new_cdata_node,text_node)
> 
> That means you know in advance that you only have a single CDATA
> fragment in the original document, you want to produce one in the
> output in the same location (i.e. inside the same element as it was in
> the original input).
> 
> What if there is more than one CDATA section in the original document?
> What if there was none?
> 

I already do checking for it being a text node and the node names that are
searched for are gauranteed to be unique and to be a single child node. 

> Regards,
> Martin
--