[Python-Dev] len(chr(i)) = 2?

M.-A. Lemburg mal at egenix.com
Thu Nov 25 10:51:09 CET 2010


Terry Reedy wrote:
> On 11/24/2010 3:06 PM, Alexander Belopolsky wrote:
> 
>> Any non-trivial text processing is likely to be broken in presence of
>> surrogates.  Producing them on input is just trading known issue for
>> an unknown one.  Processing surrogate pairs in python code is hard.
>> Software that has to support non-BMP characters will most likely be
>> written for a wide build and contain subtle bugs when run under a
>> narrow build.  Note that my latest proposal does not abolish
>> surrogates outright.  Users who want them can still use something like
>> "surrogateescape"  error handler for non-BMP characters.
> 
> It seems to me that what you are asking for is an alternate, optional,
> utf-8-bmp codec that would raise an error, in either direction, for
> non-bmp chars. Then, as you suggest, if one is not prepared for
> surrogates, they are not allowed.

That would be a possibility as well... but I doubt that many users
are going to bother, since slicing surrogates is just as bad as
slicing combining code points and the latter are much more common in
real life and they do happen to mostly live in the BMP.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 25 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list