[XML-SIG] Re: Issues with Unicode type

Eric van der Vlist vdv@dyomedea.com
23 Sep 2002 18:48:28 +0200


On Mon, 2002-09-23 at 18:14, Tom Emerson wrote:

> By default Python is using UTF-16 as its Unicode encoding. The
> code-point that you specify, U+10800, is outside the BMP and hence is
> represented by two surrogate characters in UTF-16.

Arg! Does that mean that by default Python isn't strictly conform to XML
1.0?

> If you were to recompile your Python installation to use UTF-32 as the
> Unicode character type then I expect that you will get the length you
> expect.

But that would also mean that a library relying on this would work only
with Python installations compiled to use UTF-32 :-(

> Consider:
>=20
> >>> c=3D u"\u4e00"
> >>> c
> u'\u4e00'
> >>> len(c)
> 1

Yes, my lenght being "2" was due to the fact that the character takes
more than 16 bits...

Thanks

Eric
--=20
Rendez-vous =E0 Paris.
                          http://www.technoforum.fr/integ2002/index.html
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------