writing Unicode objects to XML

Tue May 6 03:40:37 EDT 2003

Steven Taschuk wrote:

> Quoth Alex Martelli:
>   [...]
>> There is no way, in XML, to specify which characters will be encoded in
>> the native encoding (e.g. '\xc3\xa8' in utf-8 in this case) and which
>> ones will be encoded using character references instead.
> 
> A nit: whether this is true is a property of one's XML tools, not
> a property of XML itself.  It is easy to imagine XML writers with
> all sorts of policies about character encoding.  (See below.)

Nothing stops you "XML tools" from playing pleasant music as they
work, but that potential, while perhaps of some interest to you,
has *NOTHING* to do with their being *XML* tools.  I see Martin has
already pointed you to the definition of what information IS part
of XML (the "infoset") and what is totally accidental to you (any
that is NOT in the infoset), and you stubbornly insist on refusing
that the XML consortium can decide about that, and DID.  From this
I can only conclude that you're being deliberately stubborn and that
there is therefore little interest in trying to correct your errors.

But I hope that Alessio (the OP) and other readers can see the utter
absurdity of the chance you've chosen to take, and learn to use such
words as "XML" properly despite your attempts at sowing confusion.

>> Besides the issue of character references, think for example how ANY
>> piece of text MIGHT indifferently be represented as CDATA... or MIGHT
>> NOT, in a way that XML *defines* to be totally identical, indifferent,
>> interchangeable.
> 
> Same nit as before.  An XML parser could provide that information
> if desired.

An XML parser can provide all sort of extra information as and when
it wishes: there is no prohibition against that in the XML standard
any more than, say, there is in the C++ standard regarding extra info
that a C++ compiler might supply.

The fact that an XML parser can, if it wishes, freely supply information 
on the current stock market evaluation of General Motors, on the weather
forecasts it culls from the net, and/or on OS-supplied information about
a file (such as owner, group, permission, last-accessed date, ...) cannot
be construed by any stretch of imagination to indicate that all of these
pieces of information are *XML* information.  They aren't -- the XML
consortium gets to define what is and isn't, and they've decided, right
or wrong as you may think that, that this information just isn't in XML.

Information that isn't XML may be precious to you: for example, on
learning that the weather is fine, that your stocks are way up, and
that the file you're supposed to be processing was written by a
well-known idjt, you MIGHT well decide to toss the whole thing and
go have a picnic with your best girl instead.  But this important
information is still not _XML_ information, and while "XML tools" MAY
(cannot be forbidden from) supplying weather forecasts, it's still
nevertheless absurd to claim this makes weather forecasts "XML
information", which is basically what you're doing here.

> There are many kinds of equivalence between XML documents.  Since

Only one, however, is *XML* equivalence.

> XML is a serialization syntax, it is reasonable to speak of
> byte-by-byte equivalence; one might wish to do so in the context
> of digital signatures, for example (and equally one might not).

But whether one did or not, one wouldn't be speaking of XML, but
of OTHER properties which may happen to hold for a file.  To
consider "byte by byte equivalence" part of XML is just as absurd
as so considering "owned according to the OS by the same user and
group", "last accessed according to the OS on the same date",
and so on.

> there's equivalence in the sense "generates the same sequence of
> SAX events", or "generates data structures which are
> indistinguishable via DOM".  Etc., etc., until your head explodes.

Speak for yourself: the equivalences it makes sense to consider
*in XML* are a tiny subset of all those you're proposing, and
thus fall well short of head-exploding properties.

> The XML recommendation itself does not give any special status to
> any particular equivalence; in particular, it does not ever
> require XML processors to discard information about the source
> bytes.  (I'm not up on the XML Infoset stuff, but ultimately
> that's just a specific kind of equivalence, which might or might
> not be suitable for a given application.)

XML does not forbid programs from supplying all kinds of information,
but that doesn't make that information part of XML.  And the infoset
does define exactly what information IS or ISN'T part of XML -- it's
"just a specific kind of equivalence" which happens to be THE official
definition of what information IS "XML" and what isn't.

>> Maybe you can get away with something much simpler, such as, e.g., "even
>> though the encoding chosen would be perfectly able to represent directly
>> all Unicode characters, nevertheless, in order to satisfy a PHB who gives
>> what he THINKS are XML-related specs but has never read one line of the
>> XML standards, still we have to represent all characters outside of the
>> ASCII range as character references" (or, "all characters whose Unicode
>> code is even" -- just about as meaningful).
> 
> Not *quite* as meaningful, imho.

Quite as meaningful *within XML*.

> Consider writing XHTML.  Software which processes XML must (by
> spec) support UTF-8, but need not support (for example)
> ISO-8859-1.  So, for interoperability, you decide to encode in
> UTF-8, and declare encoding='utf-8' in the XML declaration.  Now
> consider software which understands (older) HTML but not XML; it

It's impossible to believe that you don't notice that suddenly
you have extended the universe of discourse *OUTSIDE* of XML,
when you EXPLICITLY say you need to consider software which
does NOT handle XML as part of the specs.  Which is why I have
to reach the conclusion that you're just being deliberately stubborn
rather than sincerely believing you are making any sort of real
contribution to the discussion.

E.g., consider a desire to encode by steganography some occult
message in the textfile you're producing (while keeping the _XML_
meaning of that textfile fixed).  Then, one sensible choice might
be to say that information carriers are text characters whose
Unicode code is even (using those whose Unicode code is odd as
deadweight instead), having them represented as character refs
rather than native encodings in order to carry one bit each.  Of
course, we're WAY outside of dealing with "XML" here -- just as
we are when we consider OTHER programs that don't deal with XML
in alternative to our hypothetical steganography decoder.

Alex