[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
23 Sep 2002 18:55:59 +0200


Eric van der Vlist <vdv@dyomedea.com> writes:

> > By default Python is using UTF-16 as its Unicode encoding. The
> > code-point that you specify, U+10800, is outside the BMP and hence is
> > represented by two surrogate characters in UTF-16.
>=20
> Arg! Does that mean that by default Python isn't strictly conform to XML
> 1.0?

No. Why do you think this?

>=20
> > If you were to recompile your Python installation to use UTF-32 as the
> > Unicode character type then I expect that you will get the length you
> > expect.
>=20
> But that would also mean that a library relying on this would work only
> with Python installations compiled to use UTF-32 :-(
>=20
> > Consider:
> >=20
> > >>> c=3D u"\u4e00"
> > >>> c
> > u'\u4e00'
> > >>> len(c)
> > 1
>=20
> Yes, my lenght being "2" was due to the fact that the character takes
> more than 16 bits...
>=20
> Thanks
>=20
> Eric
> --=20
> Rendez-vous =C2=8E=C3=A0 Paris.
>                           http://www.technoforum.fr/integ2002/index.html
> ------------------------------------------------------------------------
> Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
> (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
> ------------------------------------------------------------------------
>=20
>=20
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig