newthread 1) Re: [XML-SIG] CDATA sections still not handled

22 Jan 2001 10:27:06 +0100

* Lars Marius Garshol
|
| Because it would be a real pain, and would most likely break lots of
| applications. If text nodes can suddenly be represented as both text
| and cdata nodes, applications that only test for text nodes (and I
| assume this is the majority) will be silently losing data.

* matt@virtualspectator.com
| 
| That would make either the implementation of CDATA wrong, or the way
| you use it. 

Well, both, actually. I think the way CDATA is handled by the DOM is
wrong, in that it pushes lexical information[1] into your face and
forces you to deal with it when in 99% of the cases you do not care at
all.

SAX and expat handle this much better, by telling you about the CDATA
without forcing you to care.  xmllib gets it very wrong.

And since no Python DOMs currently create CDATA nodes and it requires
some extra thought to handle I suspect that the great majority of DOM
applications have no code to handle CDATA nodes appearing instead of
Text nodes.

| Text nodes are base classes of CDATA, so process that works on text
| nodes will implicitly work on CDATA nodes .... 

Nope, because in most cases you will test for the type of node through
the nodeType attribute, and that has different values for CDATA and
Text.

You can also test via the isinstance function, but that would tie your
application to a specific implementation and would be a very bad idea.

| Even if you try a type cast to assert this you should get a valid
| base class pointer back .... not that python on it's face worries
| too much about that.

:-)

| Otherwise I am confused as to what you mean.  It seems to me anyway
| that everyone has been trying to make the argument that they are one
| in the same, which they are in the interpretation sense.  

Exactly.

| A parser such as expat handles the inheritance perfectly since for a
| CDATA section it will give you CDATA begin and end events while
| passing the data itself into character data handlers.

This is the way to handle it, yes.

| I don't see things breaking anywhere.

Not with expat, but with the DOM and xmllib chances are that
applications written by people who are not fully into XML and the API
they are using will break when CDATA starts appearing.

* Lars Marius Garshol
|
| Furthermore, the normalize method, which many applications use to
| ensure that there are no adjacent text nodes in the DOM tree stops
| working in the presence of cdata nodes, since these are not
| normalized. 

* matt@virtualspectator.com
|
| Perhaps the specification for normalize on a nodes sub-tree is
| wrong, or, you expect it to always give you a nice single
| replacement node.  I think it is equally wrong to flatly remove all
| CDATA nodes without giving the user a handle to keep them.  

Well, I think the whole cake should have been cut up differently.
Text nodes should have a method isCDATA that could be used to check
whether it originally was a CDATA section or not. (Note that this
requires CDATA sections to give rise to separate DOM nodes, but they
tend to do that anyway.)

Normalize would then collapse both text and CDATA, which IMHO is the
only reasonable behaviour for it anyway. It is only useful to simplify
traversal of the tree, but it doesn't achieve that if CDATA nodes are
not normalized.

Any user that cares about the CDATA/text distinction will then have to
do without normalize(), but I doubt that they will care much, and in
any case they are a very small minority.

| They serve a useful purpose, and it seems bizarre that the DOM
| document builder just throws away the events that tell us we have
| come across a CDATA node.  Perhaps it should sit at the level of
| normalize itself .... pass an extra optional argument that
| translates CDATA nodes and therefore includes them in the merge?

That is an option, but I don't really like it.  If you keep the CDATA
interface you should really have HexNumericCharacterReference and
DecimalNumericCharacterReference interfaces as well.

| That it doesn't matter which way you represent any "hidden" markup
| eg as &lt; or as < within a CDATA section, expat will give '<' to
| the character data handler.  Which is useful.

Uh, no, it's actually wrong, and that's probably why expat doesn't do
it either. :-)

* Lars Marius Garshol
|
| I have been thinking lately that it would be an interesting experiment
| to make an XML parser with an interface specialized for representing
| ALL the lexical information about a document. I guess this could be
| done by passing along with every event the list of tokens that made up
| that event.

* matt@virtualspectator.com
|
| What sort of representation?

Well, say that you have an event-based interface like SAX, pyexpat or
something else, and that the event for character data is

  character_data(data, raw)

where raw is a list of tokens.  So for the document

  <doc>
  &#65;
  <![CDATA[ wheee! ]]>
  Testing testing.
  </doc>

you get these calls

  character_data('\012  ', ['\012  '])
  character_data('A', ['&#65;'])
  character_data(' wheee! ', ['<!CDATA...'])
  character_data('\012  Test...', ['\012  Test...'])

while for start tags you might get something like

  start_element('doc', {...}, ['<doc', ' ', 'version', '=', '"1.0"', '>'])

--Lars M.