[XML-SIG] Re: Issues with Unicode type

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 11:31:51 -0600


> > <?xml version=3D"1.0" encoding=3D"utf-8"?>
> >       <doc>&#67584;</doc>
> > 
> > and the length of the text node of the doc element is supposed to be 1
> > instead of 2 as expected by my (naive) implementation of the length
> > facet.
> > 
> > What makes me think that it could be a generic issue with python is the
> > following (kindly contributed by Uche):
> > 
> > <uche> >>> hex(67584)
> > <uche> '0x10800'
> > <uche> >>> c =3D u"\u10800"
> > <uche> >>> c
> > <uche> u'\u10800'
> > <uche> >>> len(c)
> > <uche> 2
> 
> By default Python is using UTF-16 as its Unicode encoding. The
> code-point that you specify, U+10800, is outside the BMP and hence is
> represented by two surrogate characters in UTF-16.
> 
> If you were to recompile your Python installation to use UTF-32 as the
> Unicode character type then I expect that you will get the length you
> expect.
> 
> Consider:
> 
> >>> c= u"\u4e00"
> >>> c
> u'\u4e00'
> >>> len(c)
> 1

Hmm.  I'm going to open my mouth and show off my ignorance now.  I should 
probably spend some time with my Tony Graham before ever posting on Unicode, 
but I don't have the time right now, and besides, there is no better way to 
get Eric an answer than to say something wrong that has to be corrected by one 
of the many Unicode gurus who I know hang around here  :-)

IIRC, UTF-16 supports the representation of characters outside the BMP by 
using surrogate pairs (SP).  If so, then the scary solution of requiring XML 
users to compile Python to use UCS-4 can be put aside.

The question would then be how to get a surrogate pair into a Python unicode 
object.  On a hunch, I tried:

>>> c = u"\uD800\uDC00"
>>> len(c)
2

So I guess the answer isn't just using the literal characters in the SP.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w
ebservices/library/ws-pyth10.html