newthread 1) Re: [XML-SIG] CDATA sections still not handled

matt matt@virtualspectator.com
Fri, 19 Jan 2001 09:17:46 +1300


On Fri, 19 Jan 2001, Mike Orr wrote:
> On Thu, Jan 18, 2001 at 10:27:46PM +1300, matt wrote:
> > For example, say one wants to transport html. 
> > Now html is usually really ugly in that it is hardly ever well formed xml. 
> > Escaping with CDATA it is an easy way to hide that, and giving that data to an
> > html renderer some time later would be fine.  Being in CDATA, it is never
> > parsed for "well formedness".
> 
> I was just about to suggest looking at it this way.  If you have a set
> of records and a certain tag contains HTML, which you don't want to 
> un-CDATA-ize because the (human) editor doesn't want to see or type
> <H1> .  

Exactly.


> 
> Three other questions.  Are there certain tags that will always be CDATA,
> or does it differ randomly from document to document?  Do you care
> whether your application changes the witespace outside that CDATA
> section, making an "equivalent" document?  Or do you want the
> indentation and all to remain exactly as it is?

Hmm, no, in my most common case, whitespace is not an issue, eg: html being
transported, but in some instances keeping the correct whitespace within
messages may be useful .... eg : when it is program code, where this could be
a) critical to preserving scope, or b) again the human readability factor.  In
any case the message is between message tags, eg : <message id='5335HJSK3'> ,
so it doesn't matter if there are numerous CDATA sections within it, which
would be the case if one was to append more data to the message instead of
doing a node replace.

> 
> If you know that a certain tag should always be CDATA, and you're
> willing to settle for an "equivalent" document otherwise, then maybe
> it doesn't matter that the parser normalizes CDATA on input, 
> because you can write it out manually and convert that tag body to CDATA.

That is what I currently do, and it works really well, and preserves my sanity
server side.

> 
> If the CDATA sections will be coming in at random and you must leave
> the document formatted exactly as it is (minus whatever changes your
> application is supposed to be making to it), then perhaps you need a
> lower-level parser than full XML.  Perhaps then you'll want to consider
> modifying one of the existing XML parser classes or the sgmllib parser
> to fit your needs.

That would defeat my intention of using xml from the point of view that it is a
standard.    What you raise though is interesting, if I go full circle and
readdress my original question that "CDATA sections are still not handled" then
I was just wondering that since one gets CDATA begin and end events while
parsing a document that contains CDATA section, then why couldn't the DOM
document still represent it as a CDATA section internally?  as it was when
first created.  Furthermore, a parser such as expat will preserve the original
form of the characters that have been escaped, and even convert them if they
happened to be in entity references.  It seems to me that the handling of CDATA
sits at the level of it's base class which is a text node and that the CDATA
sections are only used to say "don't validate the following, it is ALL
character data"..

> 
> -- 
> -Mike (Iron) Orr, iron@mso.oz.net  (if mail problems: mso@jimpick.com)
>    http://mso.oz.net/     English * Esperanto * Russkiy * Deutsch * Espan~ol
-- 

regards
Matt