[Python-Dev] PEP 393 Summer of Code Project

Sat Sep 10 03:11:18 CEST 2011

On 9/9/2011 5:21 PM, Guido van Rossum wrote:
> I, for one, am very interested. It sounds like the 'unicode' datatype
> in Jython does not in fact have O(1) indexing characteristics if the
> string contains any characters in the astral plane. Interesting. I
> wonder if you have heard from anyone about this affecting their app's
> performance?
>
> --Guido

The question is whether or how often any Jython users are yet 
indexing/slicing long strings with astral chars. If a utf-8 xml file is 
directly parsed into a DOM, then the longest decoded strings will be 
'paragraphs' that are seldom more than 1000 chars.

> On Fri, Sep 9, 2011 at 12:58 PM, fwierzbicki at gmail.com
> <fwierzbicki at gmail.com>  wrote:
>> On Fri, Sep 9, 2011 at 10:16 AM, Terry Reedy<tjreedy at udel.edu>  wrote:
>>
>>> I am curious how you index by code point rather than code unit with 16-bit
>>> code units and how it compares with the method I posted. Is there anything I
>>> can read? Reply off list if you want.
>> I'll post on-list until someone complains, just in case there are
>> interested onlookers :)
>>
>> There aren't docs, but the code is here:
>> https://bitbucket.org/jython/jython/src/8a8642e45433/src/org/python/core/PyUnicode.java
>>
>> Here are (I think) the most relevant bits for random access -- note
>> that getString() returns the internal representation of the PyUnicode
>> which is a java.lang.String
>>
>>     @Override
>>     protected PyObject pyget(int i) {
>>         if (isBasicPlane()) {
>>             return Py.makeCharacter(getString().charAt(i), true);
>>         }

This is O(1)

>>         int k = 0;
>>         while (i>  0) {
>>             int W1 = getString().charAt(k);
>>             if (W1>= 0xD800&&  W1<  0xDC00) {
>>                 k += 2;
>>             } else {
>>                 k += 1;
>>             }
>>             i--;

This is an O(n) linear scan.

>>         }
>>         int codepoint = getString().codePointAt(k);
>>         return Py.makeCharacter(codepoint, true);
>>     }

Near the beginning of this thread, I described and gave a link to my 
O(logk) algorithm, where k is the number of supplementary ('astral') 
chars. It uses bisect.bisect_left on an int array of length k 
constructed with a linear scan much like the one above, with one added 
line. The basic idea is to do the linear scan just once and save the 
locations (code point indexes) of the astral chars instead of repeating 
the scan on every access. That could be done as the string is 
constructed. The same array search works for slicing too. Jython is 
welcome to use it if you ever decide you need it.

I have in mind to someday do some timing tests with the Python version. 
I just do not know how closely results would be to those for compiled C 
or Java.

-- 
Terry Jan Reedy