[XML-SIG] Re: Issues with Unicode type

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 14:04:02 -0600


> Uche Ogbuji writes:
> > No.  A surrogate pair is one character.  It takes up 2 16-bit values,
> > but this is not the same as taking up 2 characters.  The whole point of
> > a variable-length encoding such as UTF-16 is that the number of storage
> > values is not always the same as the number of characters.
> 
> Yes, I'm aware of that. The problem is one of me being sloppy in the
> use of the word 'character'.

Ah.  I wasn't meaning to leap too hard on that.  I thhought we had a genuine 
misunderstanding on tis.


> > Yes.  Don't you see that this means that the behavior as compiled with
> > UTF-16 is wrong from a *character set* point of view?  The same code
> > point is *one* character whether encoded in UTF-7, UTF-8, UTF-16,
> > UTF-32, UCS-2, UCS-4, etc.  It is never more than one character.
> 
> Sure, but the *implementation* within the Python interpreter is
> treating characters in the astral planes as two 16-bit words, not
> one. The len() value that you get is the number of UTF-16-encoded
> words in the string. There was a very long, very drawn out discussion
> on the representation of Unicode characters in Python a while back on
> the python-i18n mailing list where this whole thing was beaten to
> death and which eventually lead to the option to compile the
> interpreter to use a 32-bit character representation.

Yes.  I'm learning about all this, and learning a lot that I would probably 
have preferred to be blissfully ignorant of  :-(

Thanks.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerworks/w
ebservices/library/ws-pyth10.html