[XML-SIG] Re: Issues with Unicode type

Uche Ogbuji uche.ogbuji@fourthought.com
23 Sep 2002 15:54:35 -0600


On Mon, 2002-09-23 at 15:38, Mike Brown wrote:
> So I think the problem here is not that Python says len(u"\uD800\uDC00") is 2
> (unless somewhere it says that Python supports Unicode 3.2) but that someone
> assumed len() returns a count of Unicode characters...

I think the real problem is rather than nothing says that len()
operating on Unicode objects is *not* a count of characters.  There is
nothing that says that len is strictly a count of storage values.  I
think it's perfectly natural to assume len() is a count of characters,
and Python's docs should be clarified in this regard.  Consider that
other built-ins such as repr and the literal parsing code does deal in
characters and not storage values.  So why should anyone expect len() to
be different.

As I said the main problem I see with all this in Python is
inconsistency and lack of docs.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API -
http://www-106.ibm.com/developerworks/linux/library/l-apache/
Python&XML column: Tour of Python/XML -
http://www.xml.com/pub/a/2002/09/18/py.html
Python/Web Services column: xmlrpclib -
http://www-106.ibm.com/developerworks/webservices/library/ws-pyth10.html