[XML-SIG] Re: Issues with Unicode type

Uche Ogbuji uche.ogbuji@fourthought.com
Tue, 24 Sep 2002 17:52:21 -0600


> 
> Martin v. Loewis writes:
>  > 3. Implement it properly. Please understand that you will be trading
>  >    efficiency for correctness.
> 
> I'm sure a small C extension could provide the needed helpers quite
> efficiently.  Even with a UCS-4 version of Python, a Unicode literal
> containing a surrogate pair (explicitly, using two \u sequences) will
> exhibit the behavior that Eric wants to see suppressed.

Yes.  That was what I figured to in my recent rumination on such literals.  My 
conclusion was *never* to use "naked" surrogate pairs in Unicode literals, 
even with UTF-16 Python.  I get the sense this is a "best practice" that 
should be clearly articulated:

Do *not* express Unicode literals using direct UTF-16 surrogate pairs, e.g. 
u"\uD800\uDC00".  *Always* use the high-order unicode literal character form 
(big-U notation), e.g. u"\U00010000".

Unless someone weighs in with reasoning against this, I'll plan to add 
something to this effect to the Akara.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html