newthread 1) Re: [XML-SIG] CDATA sections still not handled

Sat, 20 Jan 2001 09:19:33 +1300

Sorry to keep this thread going, but now it's getting really interesting ....
and useful.

On Fri, 19 Jan 2001, Lars Marius Garshol wrote:
> * matt@virtualspectator.com
> | 
> | [...] since one gets CDATA begin and end events while parsing a
> | document that contains CDATA section, then why couldn't the DOM
> | document still represent it as a CDATA section internally?  
> 
> Because it would be a real pain, and would most likely break lots of
> applications. If text nodes can suddenly be represented as both text
> and cdata nodes, applications that only test for text nodes (and I
> assume this is the majority) will be silently losing data.

That would make either the implementation of CDATA wrong, or the way you use
it.  Text nodes are base classes of CDATA, so process that works on text nodes
will implicitly work on CDATA nodes .... which it does fortunately.  Even if
you try a type cast to assert this you should get a valid base class pointer
back .... not that python on it's face worries too much about that. 

Otherwise I am confused as to what you mean.  It seems to me anyway that
everyone has been trying to make the argument that they are one in the same,
which they are in the interpretation sense.  A parser such as expat handles the
inheritance perfectly since for a CDATA section it will give you CDATA begin
and end events while passing the data itself into character data handlers.

I don't see things breaking anywhere.

> 
> Furthermore, the normalize method, which many applications use to
> ensure that there are no adjacent text nodes in the DOM tree stops
> working in the presence of cdata nodes, since these are not
> normalized. 

Perhaps the specification for normalize on a nodes sub-tree is wrong, or, you
expect it to always give you a nice single replacement node.  I think it is
equally wrong to flatly remove all CDATA nodes without giving the user a handle
to keep them.  They serve a useful purpose, and it seems bizarre that the DOM
document builder just throws away the events that tell us we have come across a
CDATA node.  Perhaps it should sit at the level of normalize itself .... pass
an extra optional argument that translates CDATA nodes and therefore includes
them in the merge?

> 
> | Furthermore, a parser such as expat will preserve the original form
> | of the characters that have been escaped, and even convert them if
> | they happened to be in entity references.  
> 
> What are you trying to say here?

That it doesn't matter which way you represent any "hidden" markup eg as &lt;
or as < within a CDATA section, expat will give '<' to the character data
handler.   Which is useful.

> 
> | It seems to me that the handling of CDATA sits at the level of it's
> | base class which is a text node and that the CDATA sections are only
> | used to say "don't validate the following, it is ALL character
> | data"..
> 
> CDATA sections and ordinary 'text'[1] are just two ways to represent
> the same thing, and applications should not care which of the two ways
> have been used. The distinction between these two ways of representing
> character data is information about how the document was put together,
> as opposed to information about what is in the document. 
> 
> In other words, this issue is really the same as the issues 'white
> space in tags is lost', 'I can't tell what character data came from
> numeric character references' and so on.
> 
> I think your current way of handling it, to control what is
> represented as CDATA in the serializer, is the correct way to do it.
> One should consider very carefully before adding information of this
> sort to the document tree (or event stream), because there is such an
> unbelievably awful lot of it that it needs to be handled with the
> greatest of care.
> 

But when you build a CDATA section in a DOM document you get a CDATA section
object, which I assume, should inherit a Text node object.

> I have been thinking lately that it would be an interesting experiment
> to make an XML parser with an interface specialized for representing
> ALL the lexical information about a document. I guess this could be
> done by passing along with every event the list of tokens that made up
> that event.

What sort of representation?

> 
> --Lars M.
> 
> [1] Correct terminology is really to call it character data. Text, as
>     defined by XML, is both markup and character data.
>

yes .... but since Text nodes inherit character data I just left that alone
......

regards
Matt

> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig