[issue18814] Add codecs.convert_surrogateescape to "clean" surrogate escaped strings

Nick Coghlan report at bugs.python.org
Tue Sep 23 12:51:12 CEST 2014


Nick Coghlan added the comment:

The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type.

Getting from "this string was decoded with the surrogateescape handler and may contain smuggled bytes" to "filter_non_utf8_data" as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________


More information about the Python-bugs-list mailing list