[XML-SIG] Handling of character entity references

pyxml@wonderclown.com pyxml@wonderclown.com
Sun, 25 May 2003 09:47:24 -0500


I am trying to produce XHTML files from input XML files which contain
a mixture of XHTML and custom markup. (I am essentially building a
template system, where the content is written devoid of any page
layout, and I then use python/PyXML to parse the content, add in
banners and navigation menus and such, and write out XHTML.) I'm
having a problem, though, getting character entity references in the
source document to pass through to the output. Things like &,
<, and > work fine, but é does not.

I do not have a complete DTD for my custom markup, as I don't
particularly care to validate it. However, the parser seems unwilling
to leave entities alone, so I have tried adding the following to my
source document:

<!DOCTYPE gallery [
    <!ENTITY % HTMLlat1 PUBLIC
       "-//W3C//ENTITIES Latin 1 for XHTML//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
    %HTMLlat1;
]>

This brings in the XHTML Latin-1 entities, which seems to work well
enough to get the parser to accept the source, but then &eacute; gets
translated to the following two-byte sequence on output: 0xC3
0xA9. Curiously enough, I have also tried to output what the parser is
giving me by printing the nodeValue of the text node containing this
entity, and I get an exception:

  File "./Gallery.py", line 39, in generateContent
    print child.nodeValue
UnicodeError: ASCII encoding error: ordinal not in range(128)

I'm not sure what to make of that; my knowledge of how Python handles
Unicode is limited.

So essentially what I'm asking is how do I get PyXML to preserve
"&eacute;" as-is and output it in the same manner when I PrettyPrint()
it? (Or, equivalently, convert it to its Unicode representation on
input and back to an entity reference on output.)

Thanks for any help,

Randall Nortman