[issue18814] Add tools for "cleaning" surrogate escaped strings
Nick Coghlan
report at bugs.python.org
Sun Aug 24 05:00:11 CEST 2014
Nick Coghlan added the comment:
Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:
# Constant in the string module (akin to string.ascii_letters et al)
escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')
# Helper to ensure a string contains no escaped surrogates
# This allows it to be safely encoded without surrogateescape
_match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
def clean(s, repl='\ufffd'):
return _match_surrogates.sub(repl, s)
# Helper to redecode a string that was decoded incorrectly
# For example, WSGI strings are passed from the server to the
# framework as latin-1 by default and may need to be redecoded
def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
return s.encode(old_encoding, old_errors).decode(encoding, errors)
In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________
More information about the Python-bugs-list
mailing list