[Python-Dev] len(chr(i)) = 2?

Wed Nov 24 21:06:25 CET 2010

On Wed, Nov 24, 2010 at 1:50 PM, M.-A. Lemburg <mal at egenix.com> wrote:
..
>> add an option for decoders that currently produce surrogate pairs to
>> treat non-BMP characters as errors and handle them according to user's
>> choice.
>
> But what do you gain by doing this ? You'd lose the round-trip
> safety of those codecs and that's not a good thing.
>

Any non-trivial text processing is likely to be broken in presence of
surrogates.  Producing them on input is just trading known issue for
an unknown one.  Processing surrogate pairs in python code is hard.
Software that has to support non-BMP characters will most likely be
written for a wide build and contain subtle bugs when run under a
narrow build.  Note that my latest proposal does not abolish
surrogates outright.  Users who want them can still use something like
"surrogateescape"  error handler for non-BMP characters.

> Since we're not going change the semantics of those APIs,
> it is OK to not support padding with non-BMP code points on
> UCS-2 builds.
>

Well, I think more users are willing to accept slightly misaligned
text in their web-app logs than those willing to cope with

Traceback (most recent call last):
  ...
TypeError: The fill character must be exactly one character long

there.

Yes, allowing non-trusted users to specify fill character is unlikely,
but it is quite likely that naive slicing or iteration over string
units would result in

Traceback (most recent call last):
  ...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

> Supporting such cases would only cause problems:
>
> * if the methods would pad with surrogates, the resulting
>  string would no longer have length n; breaking the
>  assumption that len(str.center(n)) == n
>

I agree, but how is this different from breaking the assumption that
len(chr(i)) == 1?

> * if the methods would pad with half the number of surroagtes
>  to make sure that len(str.center(n)) == n, the resulting
>  output to e.g. a terminal would be further off, than what
>  you already have with surrogates and combining code points
>  in the original string.
>

I agree again.  What I suggested on the tracker, supporting non-BMP
characters in narrow builds should mean that library functions given
input with the same UCS-4 encoding should produce output with the same
UCS-4 encoding.

> Perhaps it's time to reconsider a project I once started
> but that never got off the ground:
>
>  http://mail.python.org/pipermail/python-dev/2008-July/080911.html
>
> Here's the pre-PEP:
>
>  http://mail.python.org/pipermail/python-dev/2001-July/015938.html

I agree again, but I feel that exposing code units rather than code
points at the Python string level takes us back to 2.x days of mixing
bytes and strings.

Let me quote Guido circa 2001 again:

"""
... if we had wanted to use a
variable-lenth internal representation, we should have picked UTF-8
way back, like Perl did.  Moving to a UTF-16-based internal
representation now will give us all the problems of the Perl choice
without any of the benefits.
"""

I don't understand what changed since 2001 that made this argument
invalid.   I note that an opinion has been raised on this thread that
if we want compressed internal representation for strings, we should
use UTF-8.  I tend to agree, but UTF-8 has been repeatedly rejected as
too hard to implement.  What makes UTF-16 easier than UTF-8?  Only the
fact that you can ignore bugs longer, in my view.