[Python-Dev] PEP 393 Summer of Code Project

Sat Aug 27 00:57:37 CEST 2011

On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
>> IronPython and Jython can retain UTF-16 as their native form if that
>> makes interop cleaner, but in doing so they need to ensure that basic
>> operations like indexing and len work in terms of code points, not
>> code units, if they are to conform.

My impression is that a UFT-16 implementation, to be properly called 
such, must do len and [] in terms of code points, which is why Python's 
narrow builds are called UCS-2 and not UTF-16.

> That means that they won't conform, period. There is no efficient
> maintainable implementation strategy to achieve that property,

Given that both 'efficient' and 'maintainable' are relative terms, that 
is you pessimistic opinion, not really a fact.

> it may take well years until somebody provides an efficient
> unmaintainable implementation.
>
>> Does this make sense, or have I completely misunderstood things?
>
> You seem to assume it is ok for Jython/IronPython to provide indexing in
> O(n). It is not.

Why do you keep saying that O(n) is the alternative? I have already 
given a simple solution that is O(logk), where k is the number of 
non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1) 
otherwise (for all BMP chars). It uses O(k) space. I think that is 
pretty efficient. I suspect that is the most time efficient possible 
without using at least as much space as a UCS-4 solution. The fact that 
you and other do not want this for CPython should not preclude other 
implementations that are more tied to UTF-16 from exploring the idea.

Maintainability partly depends on whether all-codepoint support is built 
in or bolted on to a BMP-only implementation burdened with back 
compatibility for a code unit API. Maintainability is probably harder 
with a separate UTF-32 type, which CPython has but which I gather Jython 
and Iron-Python do not. It might or might not be easier is there were a 
separate internal character type containing a 32 bit code point value, 
so that interation and indexing (and single char slicing) always 
returned the same type of object regardless of whether the character was 
in the BMP or not. This certainly would help all the unicode database 
functions.

Tom Christiansen appears to have said that Perl is or will use UTF-8 
plus auxiliary arrays. If so, we will find out if they can maintain it.

---
Terry Jan Reedy