[XML-SIG] Re: Issues with Unicode type

M.-A. Lemburg mal@lemburg.com
Tue, 24 Sep 2002 09:49:33 +0200


Martin v. Loewis wrote:
> Uche Ogbuji <uche.ogbuji@fourthought.com> writes:
> 
>>I think the real problem is rather than nothing says that len()
>>operating on Unicode objects is *not* a count of characters.  There is
>>nothing that says that len is strictly a count of storage values.  I
>>think it's perfectly natural to assume len() is a count of characters,
>>and Python's docs should be clarified in this regard.  

len() counts the number of Unicode code units, not code points
and not even close to graphemes, which is what users usually
identify "characters" with.

It's a technical necessity. Special algorithms would be needed
to provide the length and index information in terms of
code points and graphemes (and words).

See my Unicode talk for details on the different terms:

     http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/