[Python-Dev] PEP 540: Add a new UTF-8 mode

Tue Dec 5 16:18:47 EST 2017

I've been discussing this PEP offline with Victor, but he suggested we
should discuss it in public instead.

I am very worried about this long and rambling PEP, and I propose that it
not be accepted without a major rewrite to focus on clarity of the
specification. The "Unicode just works" summary is more a wish than a
proper summary of the PEP.

For others interested in reviewing this, the implementation link is hidden
in the long list of links; it is http://bugs.python.org/issue29240.
<http://bugs.python.org/issue29240>

FWIW the relationship with PEP 538 is also pretty unclear. (Or maybe that's
another case of the forest and the trees.) And that PEP (while already
accepted) also comes across as rambling and vague, and I have no idea what
it actually does. And it seems to mention PEP 540 quite a few times.

So I guess PEP acceptance week is over. :-(

On Tue, Dec 5, 2017 at 7:52 AM, Victor Stinner <victor.stinner at gmail.com>
wrote:

> Hi,
>
> Since it's the PEP Acceptance Week, I try my luck! Here is my very
> long PEP to propose a tiny change. The PEP is very long to explain the
> rationale and limitations.
>
> Inaccurate tl; dr with the UTF-8 mode, Unicode "just works" as expected.
>
> Reminder: INADA Naoki was nominated as the BDFL-Delegate.
>
> https://www.python.org/dev/peps/pep-0540/
>
> Full-text below.
>
> Victor
>
>
> PEP: 540
> Title: Add a new UTF-8 mode
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner <victor.stinner at gmail.com>,
>         Nick Coghlan <ncoghlan at gmail.com>
> BDFL-Delegate: INADA Naoki
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 5-January-2016
> Python-Version: 3.7
>
>
> Abstract
> ========
>
> Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
> the locale and force the usage of the UTF-8 encoding for external
> operating system interfaces, including the standard IO streams.
>
> Essentially, the UTF-8 mode behaves as Python 2 and other C based
> applications on \*nix systems: it aims to process text as best it can,
> but it errs on the side of producing or propagating mojibake to
> subsequent components in a processing pipeline rather than requiring
> strictly valid encodings at every step in the process.
>
> The UTF-8 mode can be configured as strict to reduce the risk of
> producing or propagating mojibake.
>
> A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added to explicitly control the UTF-8 mode (including
> turning it off entirely, even in the POSIX locale).
>
>
> Rationale
> =========
>
> "It's not a bug, you must fix your locale" is not an acceptable answer
> ----------------------------------------------------------------------
>
> Since Python 3.0 was released in 2008, the usual answer to users getting
> Unicode errors is to ask developers to fix their code to handle Unicode
> properly. Most applications and Python modules were fixed, but users
> kept reporting Unicode errors regularly: see the long list of issues in
> the `Links`_ section below.
>
> In fact, a second class of bugs comes from a locale which is not properly
> configured. The usual answer to such a bug report is: "it is not a bug,
> you must fix your locale".
>
> Technically, the answer is correct, but from a practical point of view,
> the answer is not acceptable. In many cases, "fixing the issue" is a
> hard task. Moreover, sometimes, the usage of the POSIX locale is
> deliberate.
>
> A good example of a concrete issue are build systems which create a
> fresh environment for each build using a chroot, a container, a virtual
> machine or something else to get reproducible builds. Such a setup
> usually uses the POSIX locale.  To get 100% reproducible builds, the
> POSIX locale is a good choice: see the `Locales section of
> reproducible-builds.org
> <https://reproducible-builds.org/docs/locales/>`_.
>
> PEP 538 lists additional problems related to the use of Linux containers to
> run network services and command line applications.
>
> UNIX users don't expect Unicode errors, since the common command lines
> tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors -
> they produce mostly-readable text instead.
>
> These users similarly expect that tools written in Python 3 (including
> those updated from Python 2), continue to tolerate locale
> misconfigurations and avoid bothering them with text encoding details.
> From their point of the view, the bug is not their locale but is
> obviously Python 3 ("Everything else works, including Python 2, so
> what's wrong with Python 3?").
>
> Since Python 2 handles data as bytes, similar to system utilities
> written in C and C++, it's rarer in Python 2 compared to Python 3 to get
> explicit Unicode errors. It also contributes significantly to why many
> affected users perceive Python 3 as the root cause of their Unicode
> errors.
>
> At the same time, the stricter text handling model was deliberately
> introduced into Python 3 to reduce the frequency of data corruption bugs
> arising in production services due to mismatched assumptions regarding
> text encodings.  It's one thing to emit mojibake to a user's terminal
> while listing a directory, but something else entirely to store that in
> a system manifest in a database, or to send it to a remote client
> attempting to retrieve files from the system.
>
> Since different group of users have different expectations, there is no
> silver bullet which solves all issues at once. Last but not least,
> backward compatibility should be preserved whenever possible.
>
> Locale and operating system data
> --------------------------------
>
> .. _operating system data:
>
> Python uses an encoding called the "filesystem encoding" to decide how
> to encode and decode data from/to the operating system:
>
> * file content
> * command line arguments: ``sys.argv``
> * standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
> * environment variables: ``os.environ``
> * filenames: ``os.listdir(str)`` for example
> * pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
> * error messages: ``os.strerror(code)`` for example
> * user and terminal names: ``os``, ``grp`` and ``pwd`` modules
> * host name, UNIX socket path: see the ``socket`` module
> * etc.
>
> At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
> ``LC_CTYPE`` locale and then store the locale encoding as the
> "filesystem error". It's possible to get this encoding using
> ``sys.getfilesystemencoding()``. In the whole lifetime of a Python
> process, the same encoding and error handler are used to encode and
> decode data from/to the operating system.
>
> The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
> decode and encode operating system data. These functions use the
> filesystem error handler: ``sys.getfilesystemencodeerrors()``.
>
> .. note::
>    In some corner cases, the *current* ``LC_CTYPE`` locale must be used
>    instead of ``sys.getfilesystemencoding()``. For example, the ``time``
>    module uses the *current* ``LC_CTYPE`` locale to decode timezone
>    names.
>
>
> The POSIX locale and its encoding
> ---------------------------------
>
> The following environment variables are used to configure the locale, in
> this preference order:
>
> * ``LC_ALL``, most important variable
> * ``LC_CTYPE``
> * ``LANG``
>
> The POSIX locale, also known as "the C locale", is used:
>
> * if the first set variable is set to ``"C"``
> * if all these variables are unset, for example when a program is
>   started in an empty environment.
>
> The encoding of the POSIX locale must be ASCII or a superset of ASCII.
>
> On Linux, the POSIX locale uses the ASCII encoding.
>
> On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
> the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
> use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
> ``os.fsencode()`` and ``os.fsdecode()`` use
> ``locale.getpreferredencoding()`` codec. For example, if command line
> arguments are decoded by ``mbstowcs()`` and encoded back by
> ``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
> of retrieving the original byte string.
>
> To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
> really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
> POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
> alias to ASCII). If not (the effective encoding is not ASCII), Python
> uses its own ASCII codec instead of using ``mbstowcs()`` and
> ``wcstombs()`` functions for `operating system data`_.
>
> See the `POSIX locale (2016 Edition)
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
> >`_.
>
>
> POSIX locale used by mistake
> ----------------------------
>
> In many cases, the POSIX locale is not really expected by users who get
> it by mistake. Examples:
>
> * program started in an empty environment
> * User forcing LANG=C to get messages in English
> * LANG=C used for bad reasons, without being aware of the ASCII encoding
> * SSH shell
> * Linux installed with no configured locale
> * chroot environment, Docker image, container, ... with no locale is
>   configured
> * User locale set to a non-existing locale, typo in the locale name for
>   example
>
>
> C.UTF-8 and C.utf8 locales
> --------------------------
>
> Some UNIX operating systems provide a variant of the POSIX locale using
> the UTF-8 encoding:
>
> * Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
> * Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
> * HP-UX: ``"C.utf8"``
>
> It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
> proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
>
> It is not planned to add such locale to BSD systems.
>
>
> Popularity of the UTF-8 encoding
> --------------------------------
>
> Python 3 uses UTF-8 by default for Python source files.
>
> On Mac OS X, Windows and Android, Python always use UTF-8 for operating
> system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
> encoding to UTF-8".
>
> On Linux, UTF-8 became the de facto standard encoding,
> replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
> using different encodings for filenames and standard streams is likely
> to create mojibake, so UTF-8 is now used *everywhere* (at least for
> modern
> distributions using their default settings).
>
> The UTF-8 encoding is the default encoding of XML and JSON file format.
> In January 2017, UTF-8 was used in `more than 88% of web pages
> <https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
> Javascript, CSS, etc.).
>
> See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
> information on the UTF-8 codec.
>
> .. note::
>    Some applications and operating systems (especially Windows) use Byte
>    Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
>    UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
>    Python.
>
>
> Old data stored in different encodings and surrogateescape
> ----------------------------------------------------------
>
> Even if UTF-8 became the de facto standard, there are still systems in
> the wild which don't use UTF-8. And there are a lot of data stored in
> different encodings. For example, an old USB key using the ext3
> filesystem with filenames encoded to ISO 8859-1.
>
> The Linux kernel and libc don't decode filenames: a filename is used
> as a raw array of bytes. The common solution to support any filename is
> to store filenames as bytes and don't try to decode them. When displayed
> to stdout, mojibake is displayed if the filename and the terminal don't
> use the same encoding.
>
> Python 3 promotes Unicode everywhere including filenames. A solution to
> support filenames not decodable from the locale encoding was found: the
> ``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
> as surrogate characters. This error handler is used by default for
> `operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
> example (except on Windows which uses the ``strict`` error handler).
>
>
> Standard streams
> ----------------
>
> Python uses the locale encoding for standard streams: stdin, stdout and
> stderr. The ``strict`` error handler is used by stdin and stdout to
> prevent mojibake.
>
> The ``backslashreplace`` error handler is used by stderr to avoid
> Unicode encode errors when displaying non-ASCII text. It is especially
> useful when the POSIX locale is used, because this locale usually uses
> the ASCII encoding.
>
> The problem is that `operating system data`_ like filenames are decoded
> using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
> filename to stdout raises a Unicode encode error if the filename
> contains an undecoded byte stored as a surrogate character.
>
> Python 3.5+ now uses ``surrogateescape`` for stdin and stdout if the
> POSIX locale is used: `issue #19977
> <http://bugs.python.org/issue19977>`_. The idea is to pass through
> `operating system data`_ even if it means mojibake, because most UNIX
> applications work like that. Such UNIX applications often store
> filenames as bytes, in many cases because their basic design principles
> (or those of the language they're implemented in) were laid down half a
> century ago when it was still a feat for computers to handle English
> text correctly, rather than
> humans having to work with raw numeric indexes.
>
> .. note::
>    The encoding and/or the error handler of standard streams can be
>    overriden with the ``PYTHONIOENCODING`` environment variable.
>
>
> Proposal
> ========
>
> Changes
> -------
>
> Add a new UTF-8 mode, enabled by default in the POSIX locale, but
> otherwise disabled by default, to ignore the locale and force the usage
> of the UTF-8 encoding with the ``surrogateescape`` error handler,
> instead using the locale encoding (with ``strict`` or
> ``surrogateescape`` error handler depending on the case).
>
> The "normal" UTF-8 mode uses ``surrogateescape`` on the standard input
> and output streams and opened files, as well as on all operating
> system interfaces. This is the mode implicitly activated by the POSIX
> locale.
>
> The "strict" UTF-8 mode reduces the risk of producing or propogating
> mojibake: the UTF-8 encoding is used with the ``strict`` error handler
> for inputs and outputs, but the ``surrogateescape`` error handler is
> still used for `operating system data`_. This mode is never activated
> implicitly, but can be requested explicitly.
>
> The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added to control the UTF-8 mode.
>
> The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``.
>
> The UTF-8 Strict mode is configured by ``-X utf8=strict`` or
> ``PYTHONUTF8=strict``.
>
> The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
> can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
>
> Other option values fail with an error.
>
> Options priority for the UTF-8 mode:
>
> * ``PYTHONLEGACYWINDOWSFSENCODING``
> * ``-X utf8``
> * ``PYTHONUTF8``
> * POSIX locale
>
> For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
> whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so
> use the encoding of the POSIX locale.
>
> Encodings used by ``open()``, highest priority first:
>
> * *encoding* and *errors* parameters (if set)
> * UTF-8 mode
> * ``os.device_encoding(fd)``
> * ``os.getpreferredencoding(False)``
>
>
> Encoding and error handler
> --------------------------
>
> The UTF-8 mode changes the default encoding and error handler used by
> ``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
> ``sys.stdout`` and ``sys.stderr``:
>
> ============================  =======================
> ==========================  ==========================
> Function                      Default                  UTF-8 mode or
> POSIX locale  UTF-8 Strict mode
> ============================  =======================
> ==========================  ==========================
> open()                        locale/strict
> **UTF-8/surrogateescape**   **UTF-8**/strict
> os.fsdecode(), os.fsencode()  locale/surrogateescape
> **UTF-8**/surrogateescape   **UTF-8**/surrogateescape
> sys.stdin, sys.stdout         locale/strict
> **UTF-8/surrogateescape**   **UTF-8**/strict
> sys.stderr                    locale/backslashreplace
> **UTF-8**/backslashreplace  **UTF-8**/backslashreplace
> ============================  =======================
> ==========================  ==========================
>
> By comparison, Python 3.6 uses:
>
> ============================  =======================
> ==========================
> Function                      Default                  POSIX locale
> ============================  =======================
> ==========================
> open()                        locale/strict            locale/strict
> os.fsdecode(), os.fsencode()  locale/surrogateescape
>  locale/surrogateescape
> sys.stdin, sys.stdout         locale/strict
> locale/**surrogateescape**
> sys.stderr                    locale/backslashreplace
> locale/backslashreplace
> ============================  =======================
> ==========================
>
> The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
> strict mode for consistency with other standard \*nix operating system
> components: the idea is that data not encoded to UTF-8 are passed through
> "Python" without being modified, as raw bytes.
>
> The ``PYTHONIOENCODING`` environment variable has priority over the
> UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
> python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
>
> Encoding and error handler on Windows
> -------------------------------------
>
> On Windows, the encodings and error handlers are different:
>
> ============================  =======================
> ==========================  ==========================
> ==========================
> Function                      Default                  Legacy Windows
> FS encoding  UTF-8 mode                  UTF-8 Strict mode
> ============================  =======================
> ==========================  ==========================
> ==========================
> open()                        mbcs/strict              mbcs/strict
>             **UTF-8/surrogateescape**   **UTF-8**/strict
> os.fsdecode(), os.fsencode()  UTF-8/surrogatepass
> **mbcs/replace**            UTF-8/surrogatepass
> UTF-8/surrogatepass
> sys.stdin, sys.stdout         UTF-8/surrogateescape
> UTF-8/surrogateescape       UTF-8/surrogateescape
> **UTF-8/strict**
> sys.stderr                    UTF-8/backslashreplace
> UTF-8/backslashreplace      UTF-8/backslashreplace
> UTF-8/backslashreplace
> ============================  =======================
> ==========================  ==========================
> ==========================
>
> By comparison, Python 3.6 uses:
>
> ============================  =======================
> ==========================
> Function                      Default                  Legacy Windows
> FS encoding
> ============================  =======================
> ==========================
> open()                        mbcs/strict              mbcs/strict
> os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**
> sys.stdin, sys.stdout         UTF-8/surrogateescape
> UTF-8/surrogateescape
> sys.stderr                    UTF-8/backslashreplace
>  UTF-8/backslashreplace
> ============================  =======================
> ==========================
>
> The "Legacy Windows FS encoding" is enabled by setting the
> ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as
> specified in `PEP 529` .
>
> Enabling the legacy Windows filesystem encoding disables the UTF-8 mode
> (as ``-X utf8=0``).
>
> If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
> ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
> with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the
> UTF-8 encoding.
>
> There is no POSIX locale on Windows. The ANSI code page is used to the
> locale encoding, and this code page never uses the ASCII encoding.
>
>
> Rationale
> ---------
>
> The UTF-8 mode is disabled by default to keep hard Unicode errors when
> encoding or decoding `operating system data`_ failed, and to keep the
> backward compatibility. The user is responsible to enable explicitly the
> UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
> mode would be enabled *by default*.
>
> The UTF-8 mode should be used on systems known to be configured with
> UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
> the user overrides a locale *by mistake* or if a Python program is
> started with no locale configured (and so with the POSIX locale).
>
> Most UNIX applications handle `operating system data`_ as bytes, so
> ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
> limited impact on how these data are handled by the application.
>
> The Python UTF-8 mode should help to make Python more interoperable with
> the  other UNIX applications in the system assuming that *UTF-8* is used
> everywhere and that users *expect* UTF-8.
>
> Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
> Python is more convenient, since they are more commonly misconfigured
> *by mistake* (configured to use an encoding different than UTF-8,
> whereas the system uses UTF-8), rather than being misconfigured by
> intent.
>
> Expected mojibake and surrogate character issues
> ------------------------------------------------
>
> The UTF-8 mode only affects code running directly in Python, especially
> code written in pure Python. The other code, called "external code"
> here, is not aware of this mode. Examples:
>
> * C libraries called by Python modules like OpenSSL
> * The application code when Python is embedded in an application
>
> In the UTF-8 mode, Python uses the ``surrogateescape`` error handler
> which stores bytes not decodable from UTF-8 as surrogate characters.
>
> If the external code uses the locale and the locale encoding is UTF-8,
> it should work fine.
>
> External code using bytes
> ^^^^^^^^^^^^^^^^^^^^^^^^^
>
> If the external code processes data as bytes, surrogate characters are
> not an issue since they are only used inside Python. Python encodes back
> surrogate characters to bytes at the edges, before calling external
> code.
>
> The UTF-8 mode can produce mojibake since Python and external code don't
> both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
> be configured as strict to prevent mojibake and fail early when data
> is not decodable from UTF-8 or not encodable to UTF-8.
>
> External code using text
> ^^^^^^^^^^^^^^^^^^^^^^^^
>
> If the external code uses text API, for example using the ``wchar_t*`` C
> type, mojibake should not occur, but the external code can fail on
> surrogate characters.
>
>
> Use Cases
> =========
>
> The following use cases were written to help to understand the impact of
> chosen encodings and error handlers on concrete examples.
>
> The "Exception?" column shows the potential benefit of having a UTF-8
> mode which is closer to the traditional Python 2 behaviour of passing
> along raw binary data even if it isn't valid UTF-8.
>
> The "Mojibake" column shows that ignoring the locale causes a practical
> issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
> UTF-8 encoding.
>
> The ideal configuration is "No exception, no risk of mojibake", but that
> isn't always possible in the presence of non-UTF-8 encoded binary data.
>
> List a directory into stdout
> ----------------------------
>
> Script listing the content of the current directory into stdout::
>
>     import os
>     for name in os.listdir(os.curdir):
>         print(name)
>
> Result:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  No          **Yes**
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  No          **Yes**
> UTF-8 mode                No          **Yes**
> UTF-8 Strict mode         **Yes**     No
> ========================  ==========  =========
>
> "Exception?" means that the script can fail on decoding or encoding a
> filename depending on the locale or the filename.
>
> To be able to never fail that way, the program must be able to produce
> mojibake.  For automated and interactive process, mojibake is often more
> user friendly than an error with a truncated or empty output, since it
> confines the problem to the affected entry, rather than aborting the
> whole task.
>
> Example with a directory which contains the file called ``b'xxx\xff'``
> (the byte ``0xFF`` is invalid in UTF-8).
>
> Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
>
>     $ python3.7 ../ls.py
>     Traceback (most recent call last):
>       File "../ls.py", line 5, in <module>
>         print(name)
>     UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
>
>     $ python3.7 -X utf8=strict ../ls.py
>     Traceback (most recent call last):
>       File "../ls.py", line 5, in <module>
>         print(name)
>     UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
>
> The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
> but display mojibake::
>
>     $ python3.7 -X utf8 ../ls.py
>     xxx�
>
>     $ LC_ALL=C /python3.6 ../ls.py
>     xxx�
>
>     $ python2 ../ls.py
>     xxx�
>
>     $ ls
>     'xxx'$'\377'
>
>
> List a directory into a text file
> ---------------------------------
>
> Similar to the previous example, except that the listing is written into
> a text file::
>
>     import os
>     names = os.listdir(os.curdir)
>     with open("/tmp/content.txt", "w") as fp:
>         for name in names:
>             fp.write("%s\n" % name)
>
> Result:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  No          **Yes**
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  **Yes**     No
> UTF-8 mode                No          **Yes**
> UTF-8 Strict mode         **Yes**     No
> ========================  ==========  =========
>
> Again, never throwing an exception requires that mojibake can be
> produced, while preventing mojibake means that the script can fail on
> decoding or encoding a filename depending on the locale or the filename.
> Typical error::
>
>     $ LC_ALL=C python3 test.py
>     Traceback (most recent call last):
>       File "test.py", line 5, in <module>
>         fp.write("%s\n" % name)
>     UnicodeEncodeError: 'ascii' codec can't encode characters in
> position 0-1: ordinal not in range(128)
>
> Compared with native system tools::
>
>     $ ls > /tmp/content.txt
>     $ cat /tmp/content.txt
>     xxx�
>
>
> Display Unicode characters into stdout
> --------------------------------------
>
> Very basic example used to illustrate a common issue, display the euro
> sign (U+20AC: €)::
>
>     print("euro: \u20ac")
>
> Result:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  **Yes**     No
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  **Yes**     No
> UTF-8 mode                No          **Yes**
> UTF-8 Strict mode         No          **Yes**
> ========================  ==========  =========
>
> The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
> UTF-8. If the terminal uses a different encoding, we get mojibake.
>
> For example, using ``iconv`` to emulate a GB-18030 terminal inside a
> UTF-8 one::
>
>     $ python3 -c 'print("euro: \u20ac")' | iconv -f gb18030 -t utf8
>     euro: 鈧iconv: illegal input sequence at position 8
>
> The misencoding also corrupts the trailing newline such that the output
> stream isn't actually a valid GB-18030 sequence, hence the error message
> after the euro symbol is misinterpreted as a hanzi character.
>
>
> Replace a word in a text
> ------------------------
>
> The following script replaces the word "apple" with "orange". It
> reads input from stdin and writes the output into stdout::
>
>     import sys
>     text = sys.stdin.read()
>     sys.stdout.write(text.replace("apple", "orange"))
>
> Result:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  No          **Yes**
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  No          **Yes**
> UTF-8 mode                No          **Yes**
> UTF-8 Strict mode         **Yes**     No
> ========================  ==========  =========
>
> This is a case where passing along the raw bytes (by way of the
> ``surrogateescape`` error handler) will bring Python 3's behaviour back
> into line with standard operating system tools like ``sed`` and ``awk``.
>
>
> Producer-consumer model using pipes
> -----------------------------------
>
> Let's say that we have a "producer" program which writes data into its
> stdout and a "consumer" program which reads data from its stdin.
>
> On a shell, such programs are run with the command::
>
>     producer | consumer
>
> The question if these programs will work with any data and any locale.
> UNIX users don't expect Unicode errors, and so expect that such programs
> "just works", in the sense that Unicode errors may cause problems in the
> data stream, but won't cause the entire stream processing *itself* to
> abort.
>
> If the producer only produces ASCII output, no error should occur. Let's
> say that the producer writes at least one non-ASCII character (at least
> one byte in the range ``0x80..0xff``).
>
> To simplify the problem, let's say that the consumer has no output
> (doesn't write results into a file or stdout).
>
> A "Bytes producer" is an application which cannot fail with a Unicode
> error and produces bytes into stdout.
>
> Let's say that a "Bytes consumer" does not decode stdin but stores data
> as bytes: such consumer always work. Common UNIX command line tools like
> ``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
> applications are also in this category, as are applications that work
> with the lower level binary input and output stream in Python 3 rather
> than the default text mode streams.
>
> "Python producer" and "Python consumer" are producer and consumer
> implemented in Python using the default text mode input and output
> streams.
>
> Bytes producer, Bytes consumer
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> This won't through exceptions, but it is out of the scope of this PEP
> since it doesn't involve Python's default text mode input and output
> streams.
>
> Python producer, Bytes consumer
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Python producer::
>
>     print("euro: \u20ac")
>
> Result:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  **Yes**     No
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  **Yes**     No
> UTF-8 mode                No          **Yes**
> UTF-8 Strict mode         No          **Yes**
> ========================  ==========  =========
>
> The question here is not if the consumer is able to decode the input,
> but if Python is able to produce its output. So it's similar to the
> `Display Unicode characters into stdout`_ case.
>
> UTF-8 modes work with any locale since the consumer doesn't try to
> decode its stdin.
>
> Bytes producer, Python consumer
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Python consumer::
>
>     import sys
>     text = sys.stdin.read()
>     result = text.replace("apple", "orange")
>     # ignore the result
>
> Result:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  No          **Yes**
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  No          **Yes**
> UTF-8 mode                No          **Yes**
> UTF-8 Strict mode         **Yes**     No
> ========================  ==========  =========
>
> Python 3 may throw an exception on decoding stdin depending on the input
> and the locale.
>
>
> Python producer, Python consumer
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Python producer::
>
>     print("euro: \u20ac")
>
> Python consumer::
>
>     import sys
>     text = sys.stdin.read()
>     result = text.replace("apple", "orange")
>     # ignore the result
>
> Result, same Python version used for the producer and the consumer:
>
> ========================  ==========  =========
> Python                    Exception?  Mojibake?
> ========================  ==========  =========
> Python 2                  **Yes**     No
> Python 3                  **Yes**     No
> Python 3.5, POSIX locale  **Yes**     No
> UTF-8 mode                No          No(!)
> UTF-8 Strict mode         No          No(!)
> ========================  ==========  =========
>
> This case combines a Python producer with a Python consumer, and the
> result is mainly the same as that for `Python producer, Bytes
> consumer`_, since the consumer can't read what the producer can't emit.
>
> However, the behaviour of the "UTF-8" and "UTF-8 Strict" modes in this
> configuration is notable: they don't produce an exception, *and* they
> shouldn't produce mojibake, as both the producer and the consumer are
> making *consistent* assumptions regarding the text encoding used on the
> pipe between them (i.e. UTF-8).
>
> Any mojibake generated would only be in the interfaces bween the
> consuming component and the outside world (e.g. the terminal, or when
> writing to a file).
>
> Backward Compatibility
> ======================
>
> The main backward incompatible change is that the UTF-8 encoding is now
> used by default if the locale is POSIX. Since the UTF-8 encoding is used
> with the ``surrogateescape`` error handler, encoding errors should not
> occur and so the change should not break applications.
>
> The UTF-8 encoding is also quite restrictive regarding where it allows
> plain ASCII code points to appear in the byte stream, so even for
> ASCII-incompatible encodings, such byte values will often be escaped
> rather than being processed as ASCII characters.
>
> The more likely source of trouble comes from external libraries. Python
> can decode successfully data from UTF-8, but a library using the locale
> encoding can fail to encode the decoded text back to bytes. For example,
> GNU readline currently has problems on Android due to the mismatch
> between CPython's encoding assumptions there (always UTF-8) and GNU
> readline's encoding assumptions (which are based on the nominal locale).
>
> The PEP only changes the default behaviour if the locale is POSIX. For
> other locales, the *default* behaviour is unchanged.
>
> PEP 538 is a follow-up to this PEP that extends CPython's assumptions to
> other locale-aware components in the same process by explicitly coercing
> the POSIX locale to something more suitable for modern text processing.
> See that PEP for further details.
>
>
> Alternatives
> ============
>
> Don't modify the encoding of the POSIX locale
> ---------------------------------------------
>
> A first version of the PEP did not change the encoding and error handler
> used of the POSIX locale.
>
> The problem is that adding the ``-X utf8`` command line option or
> setting the ``PYTHONUTF8`` environment variable is not possible in some
> cases, or at least not convenient.
>
> Moreover, many users simply expect that Python 3 behaves as Python 2:
> don't bother them with encodings and "just works" in all cases. These
> users don't worry about mojibake, or even expect mojibake because of
> complex documents using multiple incompatibles encodings.
>
>
> Always use UTF-8
> ----------------
>
> Python already always uses the UTF-8 encoding on Mac OS X, Android and
> Windows.  Since UTF-8 became the de facto encoding, it makes sense to
> always use it on all platforms with any locale.
>
> The problem with this approach is that Python is also used extensively
> in desktop environments, and it is often a practical or even legal
> requirement to support locale encoding other than UTF-8 (for example,
> GB-18030 in China, and Shift-JIS or ISO-2022-JP in Japan)
>
> Force UTF-8 for the POSIX locale
> --------------------------------
>
> An alternative to always using UTF-8 in any case is to only use UTF-8
> when the ``LC_CTYPE`` locale is the POSIX locale.
>
> The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of  Nick
> Coghlan proposes to implement that using the ``C.UTF-8`` locale.
>
>
> Use the strict error handler for operating system data
> ------------------------------------------------------
>
> Using the ``surrogateescape`` error handler for `operating system data`_
> creates surprising surrogate characters. No Python codec (except of
> ``utf-7``) accept surrogates, and so encoding text coming from the
> operating system is likely to raise an error error. The problem is that
> the error comes late, very far from where the data was read.
>
> The ``strict`` error handler can be used instead to decode
> (``os.fsdecode()``) and encode (``os.fsencode()``) operating system
> data, to raise encoding errors as soon as possible. It helps to find
> bugs more quickly.
>
> The main drawback of this strategy is that it doesn't work in practice.
> Python 3 is designed on top on Unicode strings. Most functions expect
> Unicode and produce Unicode. Even if many operating system functions
> have two flavors, bytes and Unicode, the Unicode flavor is used in most
> cases. There are good reasons for that: Unicode is more convenient in
> Python 3 and using Unicode helps to support the full Unicode Character
> Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
> see the `PEP 528`_ and the `PEP 529`_).
>
> For example, if ``os.fsdecode()`` uses ``utf8/strict``,
> ``os.listdir(str)`` fails to list filenames of a directory if a single
> filename is not decodable from UTF-8. As a consequence,
> ``shutil.rmtree(str)`` fails to remove a directory. Undecodable
> filenames, environment variables, etc. are simply too common to make
> this alternative viable.
>
>
> Links
> =====
>
> PEPs:
>
> * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
>   "Coercing the legacy C locale to C.UTF-8"
> * `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
>   "Change Windows filesystem encoding to UTF-8"
> * `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
>   "Change Windows console encoding to UTF-8"
> * `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
>   "Non-decodable Bytes in System Character Interfaces"
>
> Main Python issues:
>
> * `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode
>   <http://bugs.python.org/issue29240>`_
> * `Issue #28180: sys.getfilesystemencoding() should default to utf-8
>   <http://bugs.python.org/issue28180>`_
> * `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
>   sys.stdout on UNIX for the C locale
>   <http://bugs.python.org/issue19977>`_
> * `Issue #19847: Setting the default filesystem-encoding
>   <http://bugs.python.org/issue19847>`_
> * `Issue #8622: Add PYTHONFSENCODING environment variable
>   <https://bugs.python.org/issue8622>`_: added but reverted because of
>   many issues, read the `Inconsistencies if locale and filesystem
>   encodings are different
>   <https://mail.python.org/pipermail/python-dev/2010-October/104509.html
> >`_
>   thread on the python-dev mailing list
>
> Incomplete list of Python issues related to Unicode errors, especially
> with the POSIX locale:
>
> * 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
>   <http://bugs.python.org/issue29042#msg283821>`_
> * 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only
>   error handler <http://bugs.python.org/issue22016>`_
> * 2014-04-27: `Issue #21368: Check for systemd locale on startup if
>   current locale is set to POSIX <http://bugs.python.org/issue21368>`_
>   -- read manually /etc/locale.conf when the locale is POSIX
> * 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell
>   with utf-8 filename <http://bugs.python.org/issue20329>`_
> * 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C
> locale
>   <http://bugs.python.org/issue19846>`_
> * 2010-05-04: `Issue #8610: Python3/POSIX:  errors if file system
>   encoding is None <http://bugs.python.org/issue8610>`_
> * 2013-08-12: `Issue #18713: Clearly document the use of
>   PYTHONIOENCODING to set surrogateescape
>   <http://bugs.python.org/issue18713>`_
> * 2013-09-27: `Issue #19100: Use backslashreplace in pprint
>   <http://bugs.python.org/issue19100>`_
> * 2012-01-05: `Issue #13717: os.walk() + print fails with
> UnicodeEncodeError
>   <http://bugs.python.org/issue13717>`_
> * 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
>   <http://bugs.python.org/issue13643>`_
> * 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default
>   for the POSIX locale <http://bugs.python.org/issue11574>`_, thread on
>   python-dev: `Low-Level Encoding Behavior on Python 3
>   <https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
> * 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler
>   for stdout <http://bugs.python.org/issue8533>`_, regrtest fails with
>   Unicode encode error if the locale is POSIX
>
> Some issues are real bugs in applications which must explicitly set the
> encoding. Well, it just works in the common case (locale configured
> correctly), so what? The program "suddenly" fails when the POSIX
> locale is used (probably for bad reasons). Such bugs are not well
> understood by users. Example of such issues:
>
> * 2013-11-21: `pip: open() uses the locale encoding to parse Python
>   script, instead of the encoding cookie
>   <http://bugs.python.org/issue19685>`_ -- pip must use the encoding
>   cookie to read a Python source code file
> * 2011-01-21: `IDLE 3.x can crash decoding recent file list
>   <http://bugs.python.org/issue10974>`_
>
>
> Prior Art
> =========
>
> Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
> variable to force UTF-8: see `perlrun
> <http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
> UTF-8 per standard stream, on input and output streams, etc.
>
>
> Post History
> ============
>
> * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
>   540 (assuming UTF-8 for *nix system boundaries)
>   <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
> * 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
>   <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html
> >`_
> * 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
>   C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
> * 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
>   to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
>   -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
>   filesystem encoding to UTF-8)
>
>
> Copyright
> =========
>
> This document has been placed in the public domain.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> guido%40python.org
>

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20171205/60b34643/attachment-0001.html>