[XML-SIG] Unicode support in xmlproc

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 27 Mar 2001 15:52:31 +0200


I have committed a few changes to xmlproc which make it generate
Unicode strings, and deal with most aspects of character sets in XML
correctly (with respect to the recommendation). In particular, it
honors the encoding attribute of the xml declaration and performs the
optional autodetection of an encoding. Encoding information provided
from a higher level (e.g. MIME content type) is still for further
study (offering a set_input_encoding on the XMLCommonParser might be
appropriate).

On Python 1.5, a fallback procedure is used which only supports a
subset of the character sets (namely, US-ASCII, UTF-8, and Latin-1);
the application then receives UTF-8 encoded byte strings from xmlproc.

AFAIK, the only missing aspect is proper support for Unicode in tag
and attribute names; XML allows for a quite long list of characters,
and I'm not sure how to best implement that. If anybody has an sre
regular expression that correctly matches the Name production of XML,
please let me know.

This code has seen only little testing, so I'm pretty sure that there
are bugs in it. If you find any problems, please post them to the list
or on SF; ideally, the major problems should be resolved before 0.7 is
released. Unfortunately, running the testsuite with xmlproc as the
default parser does no good: many test cases expect an
IncremementalParser, and drv_xmlproc is not incremental.

Regards,
Martin