[XML-SIG] entity munching monster tracked down!
Dierk Höppner
D.Hoeppner@tu-bs.de
Fri, 13 Aug 1999 08:53:26 +0200
Dear SIGgers,
when playing around with the xml-package I sent an ordinary html
file through a slightly modified xml/demo/dom/html2html.py. The
output was html, too. Almost, because except '<', '&' and '>' all
other entities vanished :-(( You can see it in the output of the
original html2html. The data contains the word 'trouvés'
which in the html output becomes 'trouvs'
My solution (the experts of you have decide if this was alright):
xml.dom.writer.HtmlWriter derives from xml.dom.writer.XmlWriter
which has a method doText. The last line says
self.stream.write(escape(data))
xml.utils.escape() just 'escapes' thos three entities mentiond above.
But it may be called with an extra table for entities to be converted.
I modified XmlWriter a little: I added
self.escapes={}
to __init__()
and in doText the last line now is
self.stream.write(escape(data, self.escapes))
In html2html I now build the almost invers version of
htmlentitydefs.entitydefs but leave out <, &, and >. (My routine
MakeEscapes()) The lines
w = HtmlWriter()
w.write(b.document)
became
w = HtmlWriter()
w.escapes = MakeEscapes()
w.write(b.document)
It works but not perfectly. In another text I had an image
<IMG ... ALT="Nächster" ...>
which becomes
<IMG ... ALT="N&auml;chster" ...>
The solution for this problem I didn't found yet :-(
Greetings
Dierk Hoeppner
Universitaetsbibliothek
Pockelsstr. 13
D-38106 Braunschweig
Germany
Tel: +49-531-391-5066 Fax: -5836
E-Mail: d.hoeppner@tu-bs.de