[Python-Dev] PEP 393 Summer of Code Project

Mon Aug 29 04:20:12 CEST 2011

Paul Moore writes:

 > IronPython and Jython can retain UTF-16 as their native form if that
 > makes interop cleaner, but in doing so they need to ensure that basic
 > operations like indexing and len work in terms of code points, not
 > code units, if they are to conform.

[...]

 > They lose the O(1) guarantee, but that's easily defensible as a
 > tradeoff to conform to underlying runtime semantics.

Unfortunately, I don't think it's all that easy to defend.  Absent PEP
393 or a restriction to the characters in the BMP, this is a very
expensive change, easily visible to interactive users, let alone
performance-hungry applications.

I personally do advocate the "array of code points" definition, but I
don't use IronPython or Jython so PEP 393 is as close to heaven as I
expect to get.  OTOH, I also use Emacsen with Mule, and I have to
admit that there is a perceptible performance hit in any large (>1 MB)
buffer containing non-ASCII characters vs. pure ASCII (the code unit
in Mule is 1 byte).  I expect that if IronPython and Jython really
want to retain native, code-unit-based representations, it's going to
be painful to conform to an "array of code points" specification.

There may need to be a compromise of the form "Implementations SHOULD
provide an implementation of str that is both O(1) in indexing and an
array of code points.  Code that is Unicode-ly correct in Python
implementing PEP 393 will need to be ported with some effort to
implementations that do not satisfy this requirement, perhaps using
different algorithms or extra libraries."