[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
23 Sep 2002 19:12:10 +0200


Eric van der Vlist <vdv@dyomedea.com> writes:

> > By default Python is using UTF-16 as its Unicode encoding. The
> > code-point that you specify, U+10800, is outside the BMP and hence is
> > represented by two surrogate characters in UTF-16.
> 
> Arg! Does that mean that by default Python isn't strictly conform to XML
> 1.0?

No. Why do you think this? Strictly speaking, XML 1.0 defines a
"character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000.
This means only characters in the Basic Multilingual Plane are allowed
in XML. James Clark's document is, strictly speaking, ill-formed.

That aside, Python does process your document, and represents the
character U+10800 as defined in the Python language definition. So if
you extend XML 1.0 to Unicode 3.2 in a canonical way, Python supports
that character as specified. Any applications that want to count
Unicode code points might need to take into account surrogates, and
possibly might not use the len() builtin.

Notice also that U+10800 is unassigned even in Unicode 3.2.

Regards,
Martin