[Expat-discuss] Encoding lower 32 characters

Mon, 30 Apr 2001 16:07:38 -0400

I must be missing something about encoding the lower 32 non-whitespace
US-ASCII characters in an XML file when using expat to read the file.

As I read the XML spec, http://www.w3.org/TR/2000/REC-xml-20001006#charsets
,  it is saying that an XML character is any legal Unicode/UCS character,
and it implies that the lower 32, non-whitespace characters are not legal
Unicode characters. The spec gives the following definition of a legal XML
character:

      Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I don't have the Unicode spec handy, but I think Unicode (and by extension
utf-8) is supposed to include all US-ASCII characters as a subset.  Is this
not so?

Since I find it hard to believe that certain US-ASCII characters were
omitted from Unicode, my next guess is that the intent of the XML spec is to
say that those special characters are not valid in an XML file; that a valid
XML file should encode those characters using character references such as
"&#6;" so that they don't appear literally in the file.

I've tried this, but when I attempt to parse a file containing one of the
special character references using expat, it generates an error indicating
that the character code is illegal.  Is this error message correct, or is
this a bug/misfeature in expat? Is it a bug in the XML spec?  If it's
correct, how can I transmit application data that contains these characters?
Clearly I can create my own application-level escaping mechanism, but
doesn't this defeat the purpose of having an application-independent
standard like XML?

Michael Wissner