[Python-Dev] PEP 393 Summer of Code Project

Stephen J. Turnbull stephen at xemacs.org
Mon Aug 29 04:20:12 CEST 2011


Paul Moore writes:

 > IronPython and Jython can retain UTF-16 as their native form if that
 > makes interop cleaner, but in doing so they need to ensure that basic
 > operations like indexing and len work in terms of code points, not
 > code units, if they are to conform.

[...]

 > They lose the O(1) guarantee, but that's easily defensible as a
 > tradeoff to conform to underlying runtime semantics.

Unfortunately, I don't think it's all that easy to defend.  Absent PEP
393 or a restriction to the characters in the BMP, this is a very
expensive change, easily visible to interactive users, let alone
performance-hungry applications.

I personally do advocate the "array of code points" definition, but I
don't use IronPython or Jython so PEP 393 is as close to heaven as I
expect to get.  OTOH, I also use Emacsen with Mule, and I have to
admit that there is a perceptible performance hit in any large (>1 MB)
buffer containing non-ASCII characters vs. pure ASCII (the code unit
in Mule is 1 byte).  I expect that if IronPython and Jython really
want to retain native, code-unit-based representations, it's going to
be painful to conform to an "array of code points" specification.

There may need to be a compromise of the form "Implementations SHOULD
provide an implementation of str that is both O(1) in indexing and an
array of code points.  Code that is Unicode-ly correct in Python
implementing PEP 393 will need to be ported with some effort to
implementations that do not satisfy this requirement, perhaps using
different algorithms or extra libraries."


More information about the Python-Dev mailing list