[XML-SIG] Handling of character entity references

Sun, 25 May 2003 13:17:32 -0400

[<pyxml@wonderclown.com>]

> I am trying to produce XHTML files from input XML files which contain
> a mixture of XHTML and custom markup.
>...
> I'm
> having a problem, though, getting character entity references in the
> source document to pass through to the output. Things like &amp;,
> &lt;, and &gt; work fine, but &eacute; does not.
>

This sounds like a fine job for XSLT, rather than custom code...

You need to realize that by the time the parser has done its job, there is
no such thing as "&eacute;" any more.  All entities and character references
will have been replaced by their actual characters.  This is how XML is
supposed to work - it is spelled out in the XML Rec.

When it comes time to output the results, the serializer has no way to check
back to see that some particular character originally was serialized as some
particular entity.  What comes out is up to the serializer.  It is also up
to the original encoding as well, since the serializer cannot output a
character that does not exist in the chosen encoding.

The question is, why do you care whether or not an "&eacute;" entity is
used?  A browser will not care.  There is no reason to go to any trouble to
get those entities inserted.  The real reason you might think you need them
is that your output is in a different encoding from what the browser
expects.  If you are getting a two-byte output for that character, then you
are probably putting out utf-8, whereas most browsers are set for either
latin-1 (iso-8859-1) or to some Windows encoding.

Also, even if the right character is there in the right encoding, if you
read it with some editors and other applications, it may __look__ wrong
because they think the encoding is different from utf-8. A browser would
display it correctly as long as it knew what the encoding was.

If you are going to produce html for a browser to display, then you have
several choices:

1) Use xslt, specify the html output method, and let the serializer handle
the issue.  It will put out entities and character references where needed.

2) Change the encoding of your output to something your browser expects.
This is not a good solution,though, because you cannot know what encoding a
random browser is set up for.

3) If you know the actual output encoding, put a meta element in the head of
the document that specifies the correct encoding.  You can find out the
syntax through Google.  This is not really an independent alternative as it
is good practice anyway.

4) Write your own serializer that converts specific characters into entities
or chracter references.  This is the least desirable because it is more work
and more prone to errors.  It is also not really necessary, as you can see
from the above.

Really, the only time it is necessary (or useful) to preserve entities like
eacute is for editing applications, and that is a very specialized area.

This question comes up a lot.  Look at the various xslt FAQs and try Google
for more discussion.  Look in the archive of this list, too.

> ... I have tried adding the following to my
> source document:
>
> <!DOCTYPE gallery [
>     <!ENTITY % HTMLlat1 PUBLIC
>        "-//W3C//ENTITIES Latin 1 for XHTML//EN"
>        "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
>     %HTMLlat1;
> ]>
>
> This brings in the XHTML Latin-1 entities, which seems to work well
> enough to get the parser to accept the source, but then &eacute; gets
> translated to the following two-byte sequence on output: 0xC3
> 0xA9.

Well, that is interesting because it is the utf-8 encoding of the value E9,
which is the latin-1 encoding of eacute.  However, the unicode character for
&eacute; (you can see this in your DTD) is U+00C9, which would be encoded in
utf-8 as C3 89.  Therefore your code is not decoding and encoding the input
correctly.  You seem to be taking the sequence of bytes of a latin-1 source
and encoding it into utf-8 as if the source were really in unicode instead
of latin-1.

This reinforces my suggestion that you use xslt for this job.  Then your
encoding issues will get handled correctly.  If you still want to do it with
your own code, then you have to make sure that

1) you are correctly informing the parser which encoding is in use, and
2) Any character strings that you generate yourself are generated with the
right encoding.  Or, to be more precise, make sure that all the code agrees
that it is using the Python internal unicode representation.  It appears
that right now that is not happening.  If you are using SAX, it might be
your SAX handler, which might be using strings with the default encoding
instead of the right encoding.

Here is a useful resource about utf-8 and unicode -

http://www.cl.cam.ac.uk/~mgk25/unicode.html

> Curiously enough, I have also tried to output what the parser is
> giving me by printing the nodeValue of the text node containing this
> entity, and I get an exception:
>
>   File "./Gallery.py", line 39, in generateContent
>     print child.nodeValue
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>

The standard print function is expecting the string it prints to be encoded
using the platform default,which is not the case here.  You can avoid this
by encoding the string using a writer from the codec module (part of the
standard library).  Read the module docs to see how.

Cheers,

Tom P