[issue18814] Add tools for "cleaning" surrogate escaped strings

Sun Aug 24 16:16:53 CEST 2014

Nick Coghlan added the comment:

My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening)

Situation:

What I have: data decoded with surrogateescape
What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation)

Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to.

The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works).

For issue #22016 I actually suggested doing this as *another* codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place.

As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________