[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)

Stephen J. Turnbull stephen at xemacs.org
Fri Aug 29 02:32:58 CEST 2014


Nick Coghlan writes:

 > The current proposal on the issue tracker is to instead take advantage of
 > the existing error handlers:
 > 
 >     def convert_surrogateescape(data, errors='replace'):
 >         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
 > 
 > That code is short, but semantically dense

And it doesn't implement your original suggestion of replacement with
'?' (and another possibility for history buffs is 0x1A, ASCII SUB).  At
least, AFAICT from the docs there's no way to specify the replacement
character; decoding always uses U+FFFD.  (If I knew how to do that, I
would have suggested this.)

 > (Added bonus: once you're alerted to the possibility, it's trivial
 > to write your own version for existing Python 3 versions.

I'm not sure that's true.  At least, to me that code was obvious -- I
got the exact definition (except for the function name) on the first
try -- but I ruled it out because it didn't implement your suggestion
of replacement with '?', even as an option.

OTOH, I think a lot of the resistance to codec-based solutions is the
misconception that en/decoding streams is expensive, or the
misconception that Python's internal representation of text as an
array of code points (rather than an array of "characters" or
"grapheme clusters") is somehow insufficient for text processing.

Steve


More information about the Python-Dev mailing list