[XML-SIG] losing entities when parsing then texting

Greg Wilson gvwilson at cs.utoronto.ca
Thu Jun 30 18:19:14 CEST 2005


This one must have come up several times before, but neither Google nor 
the Cookbook have given me an answer.  I'm doing this:

data = sys.stdin.read()
doc = xml.dom.minidom.parseString(data)
root = doc.documentElement
...add and modify some nodes...
sys.stdout.write(root.toxml('utf-8'))

A typical input looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE lec SYSTEM "swc.dtd">
<lec title="Introduction">
   <topic title="Motivation" summary="motivation for course">
     <slide>
       <b1>blah
         <b2>blah &amp; blah</b2>
         <b2>blah&emdash;blah</b2>
       </b1>
     </slide>
   </topic>
</lec>

and my DTD, in its entirety, is:

<!ENTITY emdash "&#x8212;">	<!-- em dash -->
<!ENTITY lceil  "&#x2308;">	<!-- left ceiling -->
<!ENTITY ldots  "&#x2026;">	<!-- horizontal ellipsis -->
<!ENTITY lfloor "&#x230A;">	<!-- left floor -->
<!ENTITY lquot  "&#x201C;">	<!-- left double quotes -->
<!ENTITY plusmn "&#x0177;">	<!-- plus or minus -->
<!ENTITY nbsp   "&#x00A0;">	<!-- non-breaking space -->
<!ENTITY rceil  "&#x2309;">	<!-- right ceiling -->
<!ENTITY rfloor "&#x230B;">	<!-- right floor -->
<!ENTITY rquot  "&#x201D;">	<!-- right double quotes -->
<!ENTITY space  "&#x0020;">	<!-- normal space -->
<!ENTITY squot  "&#x0022;">	<!-- straight double quotes -->
<!ENTITY times  "&#x00D7;">	<!-- multiplication sign -->
<!ENTITY vdots  "&#x22EE;">	<!-- vertical ellipsis -->

Problem is, all of the character entities are missing from my output: 
&amp; and &emdash; disappear.  Hunting around the web, it appears that 
I'm supposed to mess with ExternalEntityRefHandler, but I can't find any 
examples of how the pieces fit together.  If anyone has one, I'd be 
grateful for a pointer...

Thanks,
Greg (gvwilson _a_t_ cs _dot_ utoronto _dot_ ca)



More information about the XML-SIG mailing list