[XML-SIG] Handling of character entity references

Mon, 26 May 2003 02:12:24 -0400

On Sun, May 25, 2003 at 05:42:11PM -0600, Mike Brown wrote:
> From: Mike Brown <mike@skew.org>
> Subject: Re: [XML-SIG] Handling of character entity references
> To: Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp>
> Cc: xml-sig@python.org
> Date: Sun, 25 May 2003 17:42:11 -0600 (MDT)
> 
> Tamito KAJIYAMA wrote:
> > "Thomas B. Passin" <tpassin@comcast.net> writes:
> > |
> > | [<pyxml@wonderclown.com>]
> > | 
> > | > I am trying to produce XHTML files from input XML files which contain
> > | > a mixture of XHTML and custom markup.
> > | >...
> > | > I'm
> > | > having a problem, though, getting character entity references in the
> > | > source document to pass through to the output. Things like &amp;,
> > | > &lt;, and &gt; work fine, but &eacute; does not.
> > | >
> > | 
> > | This sounds like a fine job for XSLT, rather than custom code...
> > 
> > I had a similar problem with Randall's one a few years ago, so
> > I'd like to describe my problem and a solution to it (just FYI:
> > I totally agree with the suggestion about XSLT).
> > 
> > I've used a SAX-based Python script for years to convert a set
> > of XML files into an HTML file.  The file encodings of the input
> > and output files are EUC-JP and ISO-2022-JP, respectively.
> > I also had a need to use Latin-1 characters in the input and
> > output files.  However, because of the Japanese file encodings,
> > raw character codes (say, 0xe9 in ISO-8859-1 for &eacute;) were
> > not acceptable.  Therefore, I needed a way to represent Latin-1
> > characters in the input XML files and to produce character
> > references in the output HTML file.
> 
> This wouldn't be needed today, since python is now Unicode friendly. You have
> Unicode strings being passed to your SAX methods, and on the output side, the
> EUC-JP or ISO-2022-JP codec used by the XML serializer will convert to bytes
> all the characters supported by those encodings. The non-ASCII range of
> ISO-8859-1 (\u00A0-\u00FF) would not be handled by the codecs, but the XML
> serializer will simply deal with that by emitting numeric character references
> automatically.
> 
> > So, I decided to use a special markup to represent Latin-1
> > characters in the input XML files, as illustrated below:
> > 
> > <char name="eacute" />
> 
> Similar project:
> http://xmlchar.sourceforge.net/
> 
> Seems like it's past the point of diminishing returns, to me..
> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig

Interestingly enough, I saw the following emails on an xsl mailing
list covering this very topic:

Paul

*****

> how can you make special characters(like &eacute;) appairs in a text
> output file in xsl encoding?

I thought I'd take this opportunity to describe one of the new
features in XSLT 2.0 -- the ability to map characters in text nodes
and attribute values during output onto arbitrary strings. This is
done through a "character map".

To say that é should be output as &eacute;, for example, you can
create a character map as follows:

<xsl:character-map name="latin-1">
  <xsl:output-character character="&#xE9;" string="&amp;eacute;" />
</xsl:character-map>

and reference this character map from your output definition:

<xsl:output use-character-maps="latin-1" />

When the result tree is output, every occurrence of é, in text or in
attribute values, will be replaced by the string &eacute;.

Note that this will work for é characters that get into the output
from being part of the source document as well as the é characters
that you use in your stylesheet. It also does the replacement in
attribute values as well as text nodes. In both ways it's more
powerful than d-o-e.

Of course that might mean that your output is not well-formed, because
there's no guarantee that the output has an entity declaration for the
&eacute; entity, so you should usually specify a doctype-system so
that the output includes a DOCTYPE declaration that contains the
relevant entity declaration.

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

>I thought I'd take this opportunity to describe one of the new
>features in XSLT 2.0 -- the ability to map characters in text nodes
>and attribute values during output onto arbitrary strings. This is
>done through a "character map".

Wow! That's something we in the document world have been hacking in one
way or another for ages. People who aren't in our world often tell us
we don't need it ("just do everything in UNICODE" they say, but that
is often impractical).

This is wonderful. (I haven't been following XSLT 2.0 in detail, being
under the impression that is was totally focused on data and we
document types were going to be able to  ignore it.  Perhaps I'd
better start reading.)

-- Tommie

-- 
************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************