[XML-SIG] sax and entities

Lars Marius Garshol larsga@garshol.priv.no
08 Feb 2001 14:31:05 +0100


* Marsiske Stefan
| 
| i got a little problem. when i want to load an xml file using sax2,
| i loose entities.

You are quite right that you lose information about which character
data came from character entities, and that this information is not
passed on to the DOM.

The reason this is so is that this information is hardly ever wanted,
and keeping all information of this kind would make the SAX API a lot
more complicated.

| in one file (which is actually almost html) i have a " "
| entity, but once loaded that entity in the dom tree is already
| converted to a space. that is quite unfortunate. because i want to
| write this dom tree back after a few changes, but then this  
| is lost...

Well, first of all, it should not be converted to a space, but to the
NBSP character, ISO Latin-1 character 160, U+00A0.  If it is converted
to an NBSP character, you still have it, and it will still be there
when you write your DOM tree back, although in a different form.

If you really want to have it as an ' ' in your output XML rather
than as an NBSP character you should do something like

  string.replace(text, "\240", " ")

when you write the DOM tree out.  Exactly how to do this will depend
on your DOM implementation.


I think it would make very good sense, BTW, for the DOM serializers to
provide some mechanism for doing escapings of this kind when
serializing the DOM.  It might be that you pass a dictionary like

  {"\240" : " "}

or perhaps a function. What say ye, DOM implementors?

--Lars M.