[issue28180] sys.getfilesystemencoding() should default to utf-8

Sat Jan 7 17:20:07 EST 2017

Sworddragon added the comment:

> $ cat badfilename.py 
> badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape')
> print("bad filename:", badfn)
>
> $ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py 
> bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf
>
> $ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py 
> bad filename: �����ˤ���

The first example is still readable (but effectively for an user not so much) while the second example appears to be not readable anymore at all. But the second example is actually technically still readable and there is no data loss, isn't it? As in this case it would probably not speak against surrogateescape for sys.stderr in UTF-8 non-strict mode. Otherwise backslashescape might be indeed the better choice.

I have thought about this a bit more and in case we go PEP 538 with keeping strict errors more or less the old way there might be another solution that could improve the overall issue: print() could get an option to allow changing the error handler on demand (with 'strict' still being the default).

Most things that I do output with print() are deterministic or optional and not important application data. Being able to print this information without caring for de-/encoding errors would mitigate this issue. In case application data is being printed where data loss is not desired exceptions can still be thrown.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue28180>
_______________________________________