[XML-SIG] Re: Issues with Unicode type

Martin v. Loewis martin@v.loewis.de
24 Sep 2002 00:33:56 +0200


Uche Ogbuji <uche.ogbuji@fourthought.com> writes:

> > Yes, that is what I (thought I) said in my previous response: since
> > internally Python is representing characters outside the BMP as a
> > surrogate pair in UTF-16, the length of a Unicode string using these
> > characters is 2 --- two UTF-16 characters.
> 
> No.  A surrogate pair is one character.  

Yes: the question is what the len() function returns. The number of
characters? Apparently not. The number of code units? Yes,
definitely.

> It takes up 2 16-bit values, but this is not the same as taking up 2
> characters.

Nobody said the len function would return the number of characters: it
returns the number of code units, which is somtimes different from the
number of code points.

> No.  My whole point is that it didn't work.  len(c) would be 1, not 2 if
> the characters were properly treated as a surrogate pair. 

No. It depends on what you expect len to return. If len would return
the number of code points, it would not be additive, i.e. you code
create strings A and B such that

len(A) + len(B) <> len(A+B)

That would be confusing to implementations; it would also mean that
len(X) cannot be computed in O(1), which also would be confusing.

> Yes.  Don't you see that this means that the behavior as compiled with
> UTF-16 is wrong from a *character set* point of view?  The same code
> point is *one* character whether encoded in UTF-7, UTF-8, UTF-16,
> UTF-32, UCS-2, UCS-4, etc.  It is never more than one character.

Sure. That makes it clear that len() does not count characters.

Regards,
Martin