[issue18814] Add tools for "cleaning" surrogate escaped strings

Nick Coghlan report at bugs.python.org
Mon Aug 25 08:59:28 CEST 2014


Nick Coghlan added the comment:

Ideally we'd have string modification support for all the translations we offer as codec error handlers:

* Unicode replacement character ('replace' on input)
* ASCII question mark ('replace' on output)
* Dropping them entirely ('ignore')
* XML character reference ('xmlcharrefreplace')
* Python escape sequence ('backslashreplace')

The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases.

Now, some of these things *can* be done today using Serhiy's trick of encoding to UTF-8 and then decoding again:

    data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')

However, these two don't work properly:

    data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
    data.encode('utf-8', 'backslashreplace').decode('utf-8')

The reason those don't work is because they'll encode the *surrogate escaped bytes*, rather than the originals.

Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________


More information about the Python-bugs-list mailing list