|Title:||Add PYTHONTEXTENCODING environment variable|
|Author:||Inada Naoki <songofacandy at gmail.com>|
- Reference Implementation
- Rejected Ideas
- Open Issues
Currently, TextIOWrapper uses locale.getpreferredencoding(False) (hereinafter called "locale encoding") when encoding is not specified.
This PEP proposes adding PYTHONTEXTENCODING environment variable to override the default text encoding since Python 3.9.
The goal of this PEP is providing "UTF-8 by default" experience to Windows users, because macOS, Linux, Android, iOS users use UTF-8 by default already.
String in Python 3 is unicode. Encoding valid unicode strings with UTF-8 should not fail.
On the other hand, most locale encoding used in Windows can not save all valid unicode string. It will cause UnicodeEncodeError or it may not round-trip. User may lost their data in such case.
UTF-8 is the best encoding for saving text when user don't specify any encoding.
Python is one of the most popular first programming languages.
New programmers may not know about encoding. When they download text data written in UTF-8 from the Internet, they are forced to learn about encoding.
Popular text editors like VS Code or Atom use UTF-8 by default. Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019 Update. (Note that Python 3.9 will be released in 2021.)
Additionally, the default encoding of Python source files is UTF-8. We can assume new Python programmers who don't know about encoding use editors which use UTF-8 by default.
It would be nice if new programmers are not forced to learn about encoding until they need to handle text files encoded in encoding other than UTF-8.
Package authors using macOS or Linux may forget that the default encoding is not always UTF-8.
For example, long_description = open("README.md").read() in setup.py is a common mistake. If there is at least one emoji or any other non-ASCII character in the README.md file, many Windows users cannot install the package due to a UnicodeDecodeError.
Python has sys.defaultencoding() which is always "UTF-8". str.encode() uses "UTF-8" when encoding is omitted.
Using "UTF-8" for text files are consistent with it. It makes Python more easy to learn language.
PYTHONTEXTENCODING environment variable can be used to specify the default text encoding.
Unlike PYTHONIOENCODING, it doesn't accept error handler. PYTHONIOENCODING support it because changing error handler of stdio was difficult. But it is not true for regular files.
When PYTHONTEXTENCODING is specified, this function return it.
When it is not specified, this function returns locale.getpreferredencoding(False).
TextIOWrapper now accepts encoding="locale" option.
"locale" is not real encoding or alias. This is just a shortcut of encoding=locale.getpreferredencoding(False).
TextIOWrapper uses sys.gettextencoding() where locale.getpreferredencoding(False) is used.
But stdin, stdout, and stderr continue to respect locale encoding as well. PYTHONIOENCODING can be used to override their encoding.
Pipes and TTY should use the "locale" encoding. UTF-8 mode  can be used to override these encoding:
- subprocess and os.popen use the "locale" encoding because the subprocess will use the locale encoding.
- getpass.getpass uses the "locale" encoding when using TTY.
All other code using the default encoding are not modified. They can be overridden by PYTHONTEXTENCODING. This is an incomplete list:
- lzma.open, gzip.open, bz2.open, ZipFile.read_text
- tempfile.TemporaryFile, tempfile.NamedTemporaryFile
This PEP is not mutually exclusive to UTF-8 mode.
If we enable UTF-8 mode by default, even people using Windows will forget the default encoding is not always UTF-8. More scripts will be written assuming the default encoding is UTF-8.
So changing the default encoding of text files to UTF-8 would be better even if UTF-8 mode is enabled by default at some point.
For backward compatibility, io.TextIOWrapper calls locale.getpreferredencoding(False) every time when encoding="locale" is specified.
It respects changing locale after Python startup.
Even when the locale encoding is not UTF-8, there can be many UTF-8 text files. These files could be downloaded from the Internet or written by modern text editors.
On the other hand, terminal encoding is assumed to be the same as locale encoding. And other tools are assumed to read and write the locale encoding as well.
std(in|out|err) are likely to be connected to a terminal or other tools. So the locale encoding should be respected.
Anyway, PYTHONIOENCODING can be used to change these encodings.
To be written.
Previous version of this PEP tried to change the default encoding to UTF-8.
But we should have deprecation period long enough. Between the deprecation period, users can not change the default text encoding.
And there are many difficulty there:
- Omitting encoding option is very common.
- If we raise DeprecationWarning always, it will be too noisy.
- We can not assume how user use it. Complicated heuristics may be needed to raise DeprecationWarning only when it is really needed.
- Users of legacy systems may dismiss warning.
- They may not check the warning.
- They may upgrade Python from 2.7 after 2020.
Additionally, Microsoft is improving UTF-8 support of Windows 10 recently.
There are no public plan for future UTF-8 support yet. But Python may be able to change the default encoding without painful deprecation period in the future.
UTF-8 is the best encoding for new users. But setting environment variables is not easy enough to new users.
It would be helpfule if Python on Windows can provide easy way to set PYTHONTEXTENCODING=UTF-8 even after Python is installed.
If there is reasonable use case for changing default text encoding per process, command line option should be considered.
The default text encoding should be able to configured from C. This will be considered when writing reference Implementation.
Additionally, C-API like PySys_GetTextEncoding() should be considered too.
This document has been placed in the public domain.