[Python-Dev] UCS2/UCS4 default

Guido van Rossum guido at python.org
Thu Jul 3 18:48:38 CEST 2008


On Thu, Jul 3, 2008 at 9:35 AM, Steve Holden <steve at holdenweb.com> wrote:
> Paul Moore wrote:
>>
>> On 03/07/2008, Guido van Rossum <guido at python.org> wrote:
>>>
>>> I don't see an answer there to the question of whether the length()
>>> method of a Java String object containing a single surrogate pair
>>> returns 1 or 2; I suspect it returns 2.
>>
>> It appears you're right:
>>
>>> type testucs.java
>>
>> class testucs {
>>    public static void main(String[] args) {
>>        StringBuilder s = new StringBuilder("Hello, ");
>>        s.appendCodePoint(0x2F81A);
>>        System.out.println(s); // Display the string.
>>        System.out.println(s.length());
>>    }
>> }
>>
>>> java testucs
>>
>> Hello, ?
>> 9
>>
>>> java -version
>>
>> java version "1.6.0_05"
>> Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
>> Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
>>
>>> Python 3 supports things like
>>> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
>>> unichr and unicode literals.)
>>
>> And Java doesn't appear to - that appendCodePoint() method was
>> wonderfully hard to find :-)
>>
> There's also the issue of indexing the Unicode strings. If we are going to
> insist that len(u) counts surrogate pairs as one character then random
> access to the characters of a string is going to be an extremely inefficient
> operation.

But my whole point is that len(u) should count surrogate pairs as TWO!

> Surely it's desirable under all circumstances that
>
>  len(u) == sum(1 for c in u)
>
> and that
>
>  [c for c in u] == [c[i] for i in range(*len(u))]
>
> How would that play under Jeroen's proposed change?

I am not considering such a change. At best there will be some helper
function in unicodedata, or perhaps a helper method on the 3.0 str
type to iterate over characters instead of 16-bit values. Whether that
iterator should yield 21-bit integer values or strings containing one
character (i.e. perhaps a surrogate pair) and what it would do with
lone surrogate halves is up to the committee to design this API.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-Dev mailing list