writing Unicode objects to XML

Mon May 5 15:27:16 EDT 2003

Quoth Alex Martelli:
  [...]
> There is no way, in XML, to specify which characters will be encoded in the
> native encoding (e.g. '\xc3\xa8' in utf-8 in this case) and which ones will
> be encoded using character references instead.

A nit: whether this is true is a property of one's XML tools, not
a property of XML itself.  It is easy to imagine XML writers with
all sorts of policies about character encoding.  (See below.)

  [...]
> But, that is not the issue.  You *ARE* getting "the original XML", but
> you seem to labor under the false assumption that "the original XML"
> somehow imples (or at least implies given an encoding) "the same string
> of bytes".  It doesn't, of course.  There are MANY streams of bytes,
> even given an encoding, that could represent exactly the same XML.  Besides
> the issue of character references, think for example how ANY piece of text
> MIGHT indifferently be represented as CDATA... or MIGHT NOT, in a way that
> XML *defines* to be totally identical, indifferent, interchangeable.

Same nit as before.  An XML parser could provide that information
if desired.

It is indeed true that SAX (and DOM, I think) provides no way to
distinguish a numeric character entity reference from the
character as directly encoded.  But this is a property of these
APIs, not a property of XML proper.  (As was pointed out to me
here a little while back, DOM does tell you whether characters
occur in a CDATA section or not.)

There are many kinds of equivalence between XML documents.  Since
XML is a serialization syntax, it is reasonable to speak of
byte-by-byte equivalence; one might wish to do so in the context
of digital signatures, for example (and equally one might not).
Since it is text, we may also speak of the weaker condition of
character-by-character equivalence, disregarding encoding.  Then
there's weaker equivalences, under which (e.g.) different
sequences of whitespace in certain places might be equivalent, or
entity references might be equivalent to "inlined" forms.  Then
there's equivalence in the sense "generates the same sequence of
SAX events", or "generates data structures which are
indistinguishable via DOM".  Etc., etc., until your head explodes.

The XML recommendation itself does not give any special status to
any particular equivalence; in particular, it does not ever
require XML processors to discard information about the source
bytes.  (I'm not up on the XML Infoset stuff, but ultimately
that's just a specific kind of equivalence, which might or might
not be suitable for a given application.)

  [...]
> Maybe you can get away with something much simpler, such as, e.g., "even
> though the encoding chosen would be perfectly able to represent directly
> all Unicode characters, nevertheless, in order to satisfy a PHB who gives
> what he THINKS are XML-related specs but has never read one line of the
> XML standards, still we have to represent all characters outside of the
> ASCII range as character references" (or, "all characters whose Unicode
> code is even" -- just about as meaningful).

Not *quite* as meaningful, imho.

Consider writing XHTML.  Software which processes XML must (by
spec) support UTF-8, but need not support (for example)
ISO-8859-1.  So, for interoperability, you decide to encode in
UTF-8, and declare encoding='utf-8' in the XML declaration.  Now
consider software which understands (older) HTML but not XML; it
might well ignore the XML declaration, and might or might not
support UTF-8.  For interoperability with such software, you
decide not to encode directly any character outside US-ASCII;
instead you write such characters with numeric character entity
references or (better if possible, again for interoperability
across versions of HTML) named entities such as 'ï'.

This seems to me a reasonable and practical approach, if you want
maximum interoperability.  I can't, on the other hand, think of a
scenario in which it would be reasonable and practical to treat
specially "all characters whose Unicode code is even".

  [...]
-- 
Steven Taschuk                               staschuk at telusplanet.net
"[T]rue greatness is when your name is like ampere, watt, and fourier
 -- when it's spelled with a lower case letter."      -- R.W. Hamming