[XML-SIG] Handling of character entity references

Randall Nortman randall@wonderclown.com
Mon, 26 May 2003 10:14:58 -0500


On Sun, May 25, 2003 at 01:17:32PM -0400, Thomas B. Passin wrote:
> [<pyxml@wonderclown.com>]
> 
> > I am trying to produce XHTML files from input XML files which contain
> > a mixture of XHTML and custom markup.
> >...
> > I'm
> > having a problem, though, getting character entity references in the
> > source document to pass through to the output. Things like &amp;,
> > &lt;, and &gt; work fine, but &eacute; does not.
> >
> 
> This sounds like a fine job for XSLT, rather than custom code...
[...]

I have done similar projects using XSLT in the past, since that is
clearly the "right" tool for this job. However, I find XSLT's syntax
to be so abhorrent to good taste as to be nausea-inducing. I think its
processing model is really quite good for most applications, but I
cannot stomach actually reading and writing code in it. I have
considered designing my own syntax based on the same processing model
and a compiler to translate it to XSLT, but I haven't had the time.

So that's why I decided to try Python/PyXml for this project, to see
how it works. I have to say that I like it quite a lot, at least when
combined with the XPath module, which brings in the majority of the
power of XSLT without the nightmarish syntax. (XPath is the most
elegant and useful aspect of XSLT IMO.) Also, having the full power of
a real programming language (notably being able to use complex data
structures outside of the DOM itself) is very handy when doing more
complex work (e.g., automatically building site navigation
menus). Also, Python is faster in my experience.


> 3) If you know the actual output encoding, put a meta element in the head of
> the document that specifies the correct encoding.  You can find out the
> syntax through Google.  This is not really an independent alternative as it
> is good practice anyway.
[...]

I was already specifying the encoding as utf-8 on the output, but I
customized PrintVisitor to remove the <?xml version='1.0'
encoding='utf=8'?> prolog because it was confusing some older
browsers.  So, per your suggestion, I added a <meta> in the <head>
section to specify the character set which seems to work in most
browsers. However, I found that some browsers (notably w3m, a
text-only browser) does not support utf-8 without patches, and so I
switched to iso-8859-1, which seems to work just about anywhere.

Or at least, that works for "&eacute;". "&nbsp;" is still not
working. When I use it in my source, the output just has nothing where
there should be a "&#160;". This is true whether I use iso-8859-1 or
utf-8. Any ideas on that?


> This question comes up a lot.  Look at the various xslt FAQs and try Google
> for more discussion.  Look in the archive of this list, too.
[...]

I hate asking FAQ's on mailing lists, so I'm sorry that I apparently
ended up doing just that. I tried to find the answer via google
without luck, and alas the archives of this list are not searchable
online as far as I could find. I browsed through the subject headings
for the past couple of months but didn't see any discussion. (I
suppose I could have downloaded the 28 MB archive and grep'ed through
it myself.)


> > This brings in the XHTML Latin-1 entities, which seems to work well
> > enough to get the parser to accept the source, but then &eacute; gets
> > translated to the following two-byte sequence on output: 0xC3
> > 0xA9.
> 
> Well, that is interesting because it is the utf-8 encoding of the value E9,
> which is the latin-1 encoding of eacute.  However, the unicode character for
> &eacute; (you can see this in your DTD) is U+00C9, which would be encoded in
> utf-8 as C3 89.  Therefore your code is not decoding and encoding the input
> correctly.  You seem to be taking the sequence of bytes of a latin-1 source
> and encoding it into utf-8 as if the source were really in unicode instead
> of latin-1.
[...]

Are you sure that it should be C3 89 instead of C3 A9? The latter
seems to work, so long as I direct the browser to expect utf-8.

Thanks for your suggestions,

Randall Nortman