[Python-Dev] PEP 393 Summer of Code Project
Terry Reedy
tjreedy at udel.edu
Sat Aug 27 00:57:37 CEST 2011
On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
>> IronPython and Jython can retain UTF-16 as their native form if that
>> makes interop cleaner, but in doing so they need to ensure that basic
>> operations like indexing and len work in terms of code points, not
>> code units, if they are to conform.
My impression is that a UFT-16 implementation, to be properly called
such, must do len and [] in terms of code points, which is why Python's
narrow builds are called UCS-2 and not UTF-16.
> That means that they won't conform, period. There is no efficient
> maintainable implementation strategy to achieve that property,
Given that both 'efficient' and 'maintainable' are relative terms, that
is you pessimistic opinion, not really a fact.
> it may take well years until somebody provides an efficient
> unmaintainable implementation.
>
>> Does this make sense, or have I completely misunderstood things?
>
> You seem to assume it is ok for Jython/IronPython to provide indexing in
> O(n). It is not.
Why do you keep saying that O(n) is the alternative? I have already
given a simple solution that is O(logk), where k is the number of
non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1)
otherwise (for all BMP chars). It uses O(k) space. I think that is
pretty efficient. I suspect that is the most time efficient possible
without using at least as much space as a UCS-4 solution. The fact that
you and other do not want this for CPython should not preclude other
implementations that are more tied to UTF-16 from exploring the idea.
Maintainability partly depends on whether all-codepoint support is built
in or bolted on to a BMP-only implementation burdened with back
compatibility for a code unit API. Maintainability is probably harder
with a separate UTF-32 type, which CPython has but which I gather Jython
and Iron-Python do not. It might or might not be easier is there were a
separate internal character type containing a 32 bit code point value,
so that interation and indexing (and single char slicing) always
returned the same type of object regardless of whether the character was
in the BMP or not. This certainly would help all the unicode database
functions.
Tom Christiansen appears to have said that Perl is or will use UTF-8
plus auxiliary arrays. If so, we will find out if they can maintain it.
---
Terry Jan Reedy
More information about the Python-Dev
mailing list