[XML-SIG] Status of XML 1.1 processing in Python?

Wed Aug 24 13:27:04 CEST 2005

Many thanks to Fredrik Lundh, Fred Drake and Daniel Veillard
for information on the status of XML 1.1 processing in Python. 
I'll do my best to do some testing and report back.

    Why I need XML 1.1 characters

In case anyone is interested, my goal is to facilitate
the definition of new Unicode input methods for Mac OS X.
Apple already supplies a very human-UNfriendly XML
language for defining new input methods.  I have defined a
new human-friendly XML language and need to convert my
human-friendly XML files automatically to Apple's human-
UNfriendly XML.

The basic idea of input methods is that they
intercept incoming key events, or sequences of key events, and
map them into Unicode-character outputs that are sent to
the destination,
e.g. to the buffer of a Unicode text editor.  Some of these
Unicode output characters are control characters that are
invalid in XML 1.0 but valid in XML 1.1.  (I.e. when you
press appropriate "control" keys on your keyboard, the output
to the application is naturally a "control character".)

If you define a new OS X input method in Apple's current
XML format, the XML file contains  control characters that
are valid only in XML 1.1.  The underlying (mystery) Apple
parser that processes
that XML file does _not_ choke on the control characters,
so this processor is assuming the XML 1.1 character set,
even if the XML file is overtly marked version="1.0".   That's
a no-no, of course; if the file is marked version="1.0", then
any kosher XML processor should refuse to parse/process
the file if it contains control characters not valid in XML 1.0.

My human-friendly XML language is defined in Relax NG,
and when I specify version="1.1", the files validate as they
should using Jing.  (If I change the attribute to version="1.0", then
Jing properly refuses to validate the files because of the invalid
control characters.)  So far so good.
But then when I try to write a Python script to parse
the human-friendly XML language and convert it (very non-trivially)
to the human-unfriendly XML language defined by Apple,
the Python script (if limited to XML 1.0 processing) chokes
as soon as it sees the offending control characters.  Sigh.

Hence my need for a Python XML parsing/processing module
that handles XML 1.1 characters when the file is
appropriate marked version="1.1".

Thanks again for the pointers.

Ken