[issue18814] Add utilities to "clean" surrogate code points from strings

Sun Sep 27 14:32:15 CEST 2015

Nick Coghlan added the comment:

As far as the rationale for adding the functions at all goes, my main interest is still in having somewhere in the codecs module documentation to *define the problem*, and to my mind that entails also offering a simple way to do the relevant pre-/post-processing.

The nice aspect of building any related capabilities atop the standard error handlers is that it also means that third party modules can provide custom error handlers to support further escaping techniques, and those will also be available for use in decoding and encoding operations, rather than being specific to pre-/post-processing of the data.

However, it's also the case that we're generally going to be talking about the combination of encoding misconfiguration *and* processing data that gets potentially corrupted by the misconfiguration *and* doing something with it that isn't already handled by a surrogateescape round-trip, which is why I suspect in practice most applications are going to be able to get away with ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu ecosystem in offering that by default)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________