[issue28180] sys.getfilesystemencoding() should default to utf-8

Thu Jan 5 06:11:53 EST 2017

STINNER Victor added the comment:

Sorry, I still didn't have enough time to read carefully the PEP 538. But since the discussion already started on this issue, I will add my comments:

* I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".

* Setting the locale has an impact on all libraries running in the Python process. At this point, I'm not sure that it is what we want.

* I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the user locale uses a different encoding. I had the same concern with the PEP 528 (Change Windows console encoding to UTF-8) and PEP 529 (Change Windows filesystem encoding to UTF-8) on Windows, but these PEPs were approved and merged into Python 3.6. My fear is obviously mojibake with the other applications using the other encoding, the locale encoding. Other applications are not impacted by setlocale() in the Python process.

* I proposed an opt-in option to force UTF-8: -X utf8 command line option and PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward compatibility issues. With an opt-in option, users are better prepared for mojibake issues.

* I dislike "Backporting to earlier Python 3 releases". In my experience, changes on how Python handles text (encodings, codecs, etc.) always have subtle issues, and users dislike getting backward incompatible changes in minor releases. *Maybe* if the option is an opt-in, the risk is lower and acceptable?

* I dislike that Fedora has such downstream change. I would prefer to decide upstream how to convert UTF-8 slowly as a first-class citizen in Python. Otherwise, Fedora would behave differently than other Linux distributions and it can be painful to write applications having the same behaviour on all Linux distributions. But I also understand that Fedora has sometimes to move faster than the slow CPython project :-) Fedora can also seen as a toy to experiment changes quickly which helps to provide a wide feedback upstream to take better decision.

* Using strict or surrogateescape error handler is a very important choice which has a wide impact. If we use utf8 by default (PEP 538), people will problably complain less if Python magically pass undecoded bytes thanks to the surrogateescape. If the option is an opt-in, strict may make sense. But surrogateescape is maybe still more "convenient". I don't know at this point.

Nick: it seems like you have a well defined plan. But I dislike on multiple points. I don't know if it's better to try to convince you to change your PEP, or write a different PEP.

I planned to write such "UTF-8" PEP since 2015, but I never started because the scope is so large that I fear all tiny but annoying corner cases...

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue28180>
_______________________________________