[Python-Dev] len(chr(i)) = 2?

Tue Nov 23 20:11:06 CET 2010

On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger
<raymond.hettinger at gmail.com> wrote:
..
> Any explanation we give users needs to let them know two things:
> * that we cover the entire range of unicode not just BMP
> * that sometimes len(chr(i)) is one and sometimes two

This discussion motivated me to start looking into how well Python
library itself is prepared to deal with len(chr(i)) = 2.  I was not
surprised to find that textwrap does not handle the issue that well:

>>> len(wrap(' \U00010140' * 80, 20))
12
>>> len(wrap(' \U00000140' * 80, 20))
8

That module should probably be rewritten to properly implement  the
Unicode line breaking algorithm
<http://unicode.org/reports/tr14/tr14-22.html>.

Yet finding a bug in a str object method after a 5 min review was a
bit discouraging:

>>> 'xyz'.center(20, '\U00010140')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: The fill character must be exactly one character long

Given the apparent difficulty of writing even basic text processing
algorithms in presence of surrogate pairs, I wonder how wise it is to
expose Python users to them.  As Wikipedia explains, [1]

"""
Because the most commonly used characters are all in the Basic
Multilingual Plane, converting between surrogate pairs and the
original values is often not tested thoroughly. This leads to
persistent bugs, and potential security holes, even in popular and
well-reviewed application software.
"""

Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to
cover only BMP, maybe rather than changing the terms used in the
reference manual, we should tighten the code to conform to the updated
standards?

Again, given that the str object itself has at least one non-BMP
character bug as we are closing on the third major release of py3k,
how likely are 3rd party developers to get their libraries right as
they port to 3.x?

[1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
[2] http://unicode.org/reports/tr17/#CharacterEncodingForm