[XML-SIG] entity munching monster tracked down!

Dierk Höppner D.Hoeppner@tu-bs.de
Fri, 13 Aug 1999 08:53:26 +0200


Dear SIGgers,

when playing around with the xml-package I sent an ordinary html 
file through a slightly modified xml/demo/dom/html2html.py. The 
output was html, too. Almost, because except '<', '&' and '>' all 
other entities vanished :-(( You can see it in the output of the 
original html2html. The data contains the word 'trouv&eacute;s' 
which in the html output becomes 'trouvs'

My solution (the experts of you have decide if this was alright): 

xml.dom.writer.HtmlWriter derives from xml.dom.writer.XmlWriter 
which has a method doText. The last line says 

self.stream.write(escape(data))

xml.utils.escape() just 'escapes' thos three entities mentiond above. 
But it may be called with an extra table for entities to be converted. 
I modified XmlWriter a little: I added

self.escapes={}

to __init__()

and in doText the last line now is

self.stream.write(escape(data, self.escapes))

In html2html I now build the almost invers version of 
htmlentitydefs.entitydefs but leave out <, &, and >. (My routine 
MakeEscapes()) The lines

w = HtmlWriter()
w.write(b.document)

became

w = HtmlWriter()
w.escapes = MakeEscapes()
w.write(b.document)

It works but not perfectly. In another text I had an image

<IMG ... ALT="N&auml;chster" ...>

which becomes

<IMG ... ALT="N&amp;auml;chster" ...>

The solution for this problem I didn't found yet :-(

Greetings

Dierk Hoeppner

Universitaetsbibliothek
Pockelsstr. 13
D-38106 Braunschweig
Germany
Tel: +49-531-391-5066 Fax: -5836
E-Mail: d.hoeppner@tu-bs.de