[Python-Dev] len(chr(i)) = 2?

Wed Nov 24 18:37:43 CET 2010

On Tue, Nov 23, 2010 at 2:18 PM, Amaury Forgeot d'Arc
<amauryfa at gmail.com> wrote:
..
>> Given the apparent difficulty of writing even basic text processing
>> algorithms in presence of surrogate pairs, I wonder how wise it is to
>> expose Python users to them.
>
> This was already discussed two years ago:
>
> http://mail.python.org/pipermail/python-dev/2008-July/080900.html
>

Thanks for the link.   Let me summarize that discussion as I read it.

The discussion starts with a reference to Guido's 2001 post which concluded with

"""
... if we had wanted to use a
variable-lenth internal representation, we should have picked UTF-8
way back, like Perl did.  Moving to a UTF-16-based internal
representation now will give us all the problems of the Perl choice
without any of the benefits.
""" [1]

and proposes to move to USC-4 completely for Python 3.0.  Note that
this is not the option that I would like to discuss here.   I don't
propose to discuss abandoning narrow builds.  Instead, I would like to
discuss the costs and benefits associated with using variable width
CES as an internal representation.  This is where the 2008 discussion
moved.  OP did not realize that narrow build supported UTF-16 and like
myself was surprised that application developers should be aware of
surrogates if they want to use narrow builds.  It was also suggested
that Python itself is likely to have many bugs that can be triggered
by non-BMP characters on narrow builds.  Guido's response was:

"""
I'd also prefer to receive bug reports about breakages actually
encountered in the wild than purely theoretical issues
"""

I don't think this is a good position to take.  Programs that expect
one code unit where Python may produce two are likely to have security
holes.  Even when programmers carefully sanitize their input, they are
likely to do it at the code point level based on Unicode category and
0xFFFF boundary does not mean anything special for their applications.
  I think anyone who wants to write a robust application has two
choices in practice:  (a) use wide Unicode build; (b) restrict all
text to BMP.  Supporting surrogates at the application level is likely
to be prohibitively expensive.

It was later suggested that the main benefit of "UTF-16" builds is
that they can easily interface with system libraries that are "UTF-16"
based.  However, how likely are these libraries be bug-free when it
comes to non-BMP characters?  The history teaches us that not very
likely.

Daniel Arbuckle presented arguments against imposing the burden of
dealing with surrogates on application writers. [2]

The recurrent theme on the thread was that non-BMP characters are rare
and those who need them can afford the extra development cost
associated with the surrogates.  This point was very eloquently
articulated by Guido:

"""
Who are the many here? Who are the few? I'd venture that (at least for
the foreseeable future, say, until China will finally have taken over
the role of the US as the de-facto dominant super power :-) the many
are people whose app will never see a Unicode character outside the
BMP, or who do such minimal string processing that their code doesn't
care whether it's handling UTF-16-encoded data.
""" [3]

This argument can also be used to support the position that narrow
builds should not support non-BMP characters.

Later the discussion started resembling this thread when it went into
a scholastic dispute over fine points in Unicode Standard terminology.
:-)

Then BDFL vetoed len(u"\U00012345") returning 1 on narrow builds. [4]
I would be against that as well.  I don't see len("\U00012345") == 2
as a big problem because application developers can simply avoid using
\U literals if they don't want to support non-BMP characters.  On the
other hand, an option to warn users about non-BMP literals on a narrow
build may be useful but it is easy to implement in lint-like tools.

There were multiple suggestions for standard library additions to help
application writers to deal with surrogate pairs, but as far as I can
tell, nothing has been done in this area in the following two years.
I don't think there is a recipe on how to fix legacy
character-by-character processing loop such as

   for c in string:
      ...

to make it iterate over code points consistently in wide and narrow
builds.  (Note that I am not asking for a grapheme iterator here.
This is clearly an application level feature.)

> So yes, wrap() and center() should be fixed.

I opened an issue 10521 for that. [5]  I am fully prepared to see it
dismissed as "theoretical" and be closed with "won't fix" or linger
indefinitely.   Fixing it would most likely involve writing the second
version of pad() utility function specifically for the narrow build.

All examples I've seen in Python C code of dealing with surrogates
came with hand-coded #ifndef Py_UNICODE_WIDE fragments and no
user-friendly macros or APIs that would abstract it away.

A quick grep for maxunicode in the standard library revealed only one
case of "narrow-build aware" code:

        if sys.maxunicode != 65535:
            # XXX: negation does not work with big charsets
            return charset

See  Lib/sre_compile.py.  Not exactly a model to follow.

To conclude, I feel that rather than trying to fully support non-BMP
characters as surrogate pairs in narrow builds, we should make it
easier for application developers to avoid them.  If abandoning
internal use of UTF-16 is not an option, I think we should at least
add an option for decoders that currently produce surrogate pairs to
treat non-BMP characters as errors and handle them according to user's
choice.

[1] http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html
[2] http://mail.python.org/pipermail/python-dev/2008-July/080912.html
[3] http://mail.python.org/pipermail/python-dev/2008-July/080940.html
[4] http://mail.python.org/pipermail/python-dev/2008-July/080916.html
[5] http://bugs.python.org/issue10521