[XML-SIG] Re: Issues with Unicode type

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 16:41:58 -0600


> However that is really two characters 0x1080 and 0x0030.  \u (lowercase)
> only takes 4 hex digits.  \U (uppercase) takes 8 digits.  So to create the
> character 0x10800, the sequence should be u'\U0010800'.

Right, Jeremy.  I wasn't squinting hard enough at Daniel's example.  In my own 
examples, I've been using

u"\U00010000"

or

u"\uD800\uDC00"

These are actually equivalent if Python is compiled for UTF-16 encoding: In 
the top example, Python breaks the full code point into its UTF-16 
representation, and so ends up with the same internal object as the second 
form.

I'm not sure whether they would be equivalent if Python is compiled for UCS-4 
(BTW, there is no diff between UTF-32 and UCS-4, is there?).  I would imagine 
Python would blindly create 2 pseudo code points D800 and DC00.  I say 
"pseudo" since, because these values are in the surrogate blocks, they are not 
valid characters in themselves.

Which leads me to believe that even though u"\uD800\uDC00" would be treated 
equivalently to u"\U00010000" as long as Python is compiled for UTF-16, that 
it is a *very* bad idea to write unicode literals that way.

I'm learning a lot today  :-)


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w
ebservices/library/ws-pyth10.html