[Python-ideas] Processing surrogates in

Serhiy Storchaka storchaka at gmail.com
Mon May 4 23:57:56 CEST 2015


On 05.05.15 00:21, Stephen J. Turnbull wrote:
> Serhiy Storchaka writes:
>   > In issue18814 proposed several functions to work with surrogate and
>   > astral characters. All these functions takes a string and returns a
>   > string.
>
> What's the use case?  As far as I can see, in recent Python 3 PEP 393
> is implemented, so non-BMP characters are represented as themselves,
> not as surrogate pairs.  In a PEP 393-enabled Python, the only
> surrogates should be those due to surrogateescape error handling on
> input, and chr().  If you don't like the former, be careful about your
> use of surrogateescape, and the latter is clearly a "consenting
> adults" issue.

Use cases include programs that use tkinter (common build of Tcl/Tk 
don't accept non-BMP characters), email or wsgiref.

> Also, you mention that such surrogate characters can be received as
> input, which is true, but the standard codecs should already be
> treating those as errors.

Usually surrogate characters came from decoding with "surrogatepass" or 
"surrogateescape" error handlers. That is why Nick proposed names 
rehandle_surrogatepass and rehandle_surrogateescape.

> So as far as I can see, the existing codecs and error handlers already
> can deal with any case I might run into in practice.

See issue18814. It is not so easy to get desirable result. Perhaps the 
simplest and most efficient way is to use regular expressions, and it is 
used in Python implementations, but C implementation can be much more 
efficient.




More information about the Python-ideas mailing list