[Python-Dev] len(chr(i)) = 2?

Stephen J. Turnbull stephen at xemacs.org
Wed Nov 24 03:29:47 CET 2010


Alexander Belopolsky writes:

 > Yet finding a bug in a str object method after a 5 min review was a
 > bit discouraging:
 > 
 > >>> 'xyz'.center(20, '\U00010140')
 > Traceback (most recent call last):
 >   File "<stdin>", line 1, in <module>
 > TypeError: The fill character must be exactly one character long
 > 
 > Given the apparent difficulty of writing even basic text processing
 > algorithms in presence of surrogate pairs, I wonder how wise it is to
 > expose Python users to them.

"Consenting adults" applies here.

What to do?  Write tests, fix the stdlib.  Raise the probability of
surrogate pair tests in the fuzzer.

But "expose the users to surrogate pairs in an efficient (ie, UCS-2)
implementation" is a fundamental design principle of Python.
Tightening up the internal implementation is -10 unacceptable IMO
YMMV.

 > Again, given that the str object itself has at least one non-BMP
 > character bug as we are closing on the third major release of py3k,
 > how likely are 3rd party developers to get their libraries right as
 > they port to 3.x?

Not our problem, really.  We need to fix the stdlib, but 3rd party
libraries know what they're doing.

I guess we could provide a fuzztest module that generates known nasty
data (zero, very big numbers, "\0x00", "\U00010140", etc) that people
would be able to plug in as a data source for their own code.

Of course that doesn't replace conventional unittests based on
analysis of edge cases and tests designed to tickle them, but it would
be a start for many projects.




More information about the Python-Dev mailing list