[Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Fri Dec 8 00:02:23 EST 2017

Looks nice.

But I want to clarify more about difference/relationship between PEP
538 and 540.

If I understand correctly:

Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
same logic to detect POSIX locale.

When POSIX locale is detected, locale coercion is tried first. And if
locale coercion
succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.

If locale coercion is disabled or failed, UTF-8 mode is used automatically,
unless it is disabled explicitly.

UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales.
But UTF-8 mode is different from C.UTF-8 locale in these ways because
actual locale is not changed:

* Libraries using locale (e.g. readline) works as in POSIX locale.  So UTF-8
  cannot be used in such libraries.
* locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'.  So
  libraries depending on locale.getpreferredencoding() may raise
  UnicodeErrors.

Am I correct?
Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

INADA Naoki  <songofacandy at gmail.com>

On Fri, Dec 8, 2017 at 9:50 AM, Victor Stinner <victor.stinner at gmail.com> wrote:
> Hi,
>
> I made the following two changes to the PEP 540:
>
> * open() error handler remains "strict"
> * remove the "Strict UTF8 mode" which doesn't make much sense anymore
>
> I wrote the Strict UTF-8 mode when open() used surrogateescape error
> handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is
> required just to change the error handler of stdin and stdout. Well,
> read the "Passthough undecodable bytes: surrogateescape" section of
> the PEP rationale :-)
>
>
> https://www.python.org/dev/peps/pep-0540/
>
> Victor
>
>
> PEP: 540
> Title: Add a new UTF-8 mode
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner <victor.stinner at gmail.com>
> BDFL-Delegate: INADA Naoki
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 5-January-2016
> Python-Version: 3.7
>
>
> Abstract
> ========
>
> Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
> change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
> This mode is enabled by default in the POSIX locale, but otherwise
> disabled by default.
>
> The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added to control the UTF-8 mode.
>
>
> Rationale
> =========
>
> Locale encoding and UTF-8
> -------------------------
>
> Python 3.6 uses the locale encoding for filenames, environment
> variables, standard streams, etc. The locale encoding is inherited from
> the locale; the encoding and the locale are tightly coupled.
>
> Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
> locale, but are unable change the locale for different reasons. This
> encoding is very limited in term of Unicode support: any non-ASCII
> character is likely to cause troubles.
>
> It is not easy to get the expected locale. Locales don't get the exact
> same name on all Linux distributions, FreeBSD, macOS, etc. Some
> locales, like the recent ``C.UTF-8`` locale, are only supported by a few
> platforms. For example, a SSH connection can use a different encoding
> than the filesystem or terminal encoding of the local host.
>
> On the other side, Python 3.6 is already using UTF-8 by default on
> macOS, Android and Windows (PEP 529) for most functions, except of
> ``open()``. UTF-8 is also the default encoding of Python scripts, XML
> and JSON file formats. The Go programming language uses UTF-8 for
> strings.
>
> When all data are stored as UTF-8 but the locale is often misconfigured,
> an obvious solution is to ignore the locale and use UTF-8.
>
> PEP 538 attempts to mitigate this problem by coercing the C locale
> to a UTF-8 based locale when one is available, but that isn't a
> universal solution. For example, CentOS 7's container images default
> to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
> locale coercion is ineffective.
>
>
> Passthough undecodable bytes: surrogateescape
> ---------------------------------------------
>
> When decoding bytes from UTF-8 using the ``strict`` error handler, which
> is the default, Python 3 raises a ``UnicodeDecodeError`` on the first
> undecodable byte.
>
> Unix command line tools like ``cat`` or ``grep`` and most Python 2
> applications simply do not have this class of bugs: they don't decode
> data, but process data as a raw bytes sequence.
>
> Python 3 already has a solution to behave like Unix tools and Python 2:
> the ``surrogateescape`` error handler (:pep:`383`). It allows to process
> data "as bytes" but uses Unicode in practice (undecodable bytes are
> stored as surrogate characters).
>
> The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
> and ``stdout`` since these streams as commonly associated to Unix
> command line tools.
>
> However, users have a different expectation on files. Files are expected
> to be properly encoded. Python is expected to fail early when ``open()``
> is called with the wrong options, like opening a JPEG picture in text
> mode. The ``open()`` default error handler remains ``strict`` for these
> reasons.
>
>
> No change by default for best backward compatibility
> ----------------------------------------------------
>
> While UTF-8 is perfect in most cases, sometimes the locale encoding is
> actually the best encoding.
>
> This PEP changes the behaviour for the POSIX locale since this locale
> usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
> It does not change the behaviour for other locales to prevent any risk
> or regression.
>
> As users are responsible to enable explicitly the new UTF-8 mode, they
> are responsible for any potential mojibake issues caused by this mode.
>
>
> Proposal
> ========
>
> Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
> change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
> This mode is enabled by default in the POSIX locale, but otherwise
> disabled by default.
>
> The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
> variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
> ``PYTHONUTF8=1``.
>
> The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
> can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
>
> For standard streams, the ``PYTHONIOENCODING`` environment variable has
> priority over the UTF-8 mode.
>
> On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
> (:pep:`529`) has the priority over the UTF-8 mode.
>
>
> Backward Compatibility
> ======================
>
> The only backward incompatible change is that the UTF-8 encoding is now
> used for the POSIX locale.
>
>
> Annex: Encodings And Error Handlers
> ===================================
>
> The UTF-8 mode changes the default encoding and error handler used by
> ``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
> ``sys.stdout`` and ``sys.stderr``.
>
> Encoding and error handler
> --------------------------
>
> ============================  =======================
> ==========================
> Function                      Default                  UTF-8 mode or
> POSIX locale
> ============================  =======================
> ==========================
> open()                        locale/strict            **UTF-8**/strict
> os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape
> sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**
> sys.stderr                    locale/backslashreplace
> **UTF-8**/backslashreplace
> ============================  =======================
> ==========================
>
> By comparison, Python 3.6 uses:
>
> ============================  =======================
> ==========================
> Function                      Default                  POSIX locale
> ============================  =======================
> ==========================
> open()                        locale/strict            locale/strict
> os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape
> sys.stdin, sys.stdout         locale/strict
> locale/**surrogateescape**
> sys.stderr                    locale/backslashreplace  locale/backslashreplace
> ============================  =======================
> ==========================
>
> Encoding and error handler on Windows
> -------------------------------------
>
> On Windows, the encodings and error handlers are different:
>
> ============================  =======================
> ==========================  ==========================
> Function                      Default                  Legacy Windows
> FS encoding  UTF-8 mode
> ============================  =======================
> ==========================  ==========================
> open()                        mbcs/strict              mbcs/strict
>             **UTF-8**/strict
> os.fsdecode(), os.fsencode()  UTF-8/surrogatepass
> **mbcs/replace**            UTF-8/surrogatepass
> sys.stdin, sys.stdout         UTF-8/surrogateescape
> UTF-8/surrogateescape       UTF-8/surrogateescape
> sys.stderr                    UTF-8/backslashreplace
> UTF-8/backslashreplace      UTF-8/backslashreplace
> ============================  =======================
> ==========================  ==========================
>
> By comparison, Python 3.6 uses:
>
> ============================  =======================
> ==========================
> Function                      Default                  Legacy Windows
> FS encoding
> ============================  =======================
> ==========================
> open()                        mbcs/strict              mbcs/strict
> os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**
> sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape
> sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace
> ============================  =======================
> ==========================
>
> The "Legacy Windows FS encoding" is enabled by the
> ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable.
>
> If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
> ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
> in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
> encoding.
>
> .. note:
>    There is no POSIX locale on Windows. The ANSI code page is used to the
>    locale encoding, and this code page never uses the ASCII encoding.
>
>
> Annex: Differences between PEP 538 and PEP 540
> ==============================================
>
> PEP 538's locale coercion is only effective if a suitable UTF-8
> based locale is available as a coercion target. PEP 540's
> UTF-8 mode can be enabled even for operating systems that don't
> provide a suitable platform locale (such as CentOS 7).
>
> PEP 538 only changes the interpreter's behaviour for the C locale. While the
> new UTF-8 mode of this PEP is only enabled by default in the C locale, it can
> also be enabled manually for any other locale.
>
> PEP 538 is implemented with ``setlocale(LC_CTYPE, "<coercion target>")`` and
> ``setenv("LC_CTYPE", "<coercion target>")``, so any non-Python code running
> in the process and any subprocesses that inherit the environment is impacted
> by the change. PEP 540 is implemented in Python internals and ignores the
> locale: non-Python running in the same process is not aware of the
> "Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
> ensure that encoding handling in binary extension modules and subprocesses
> is consistent with CPython's encoding handling. The upside of the PEP 540
> approach is that it allows an embedding application to change the
> interpreter's behaviour without having to change the process global
> locale settings.
>
>
> Links
> =====
>
> * `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
>   <http://bugs.python.org/issue29240>`_
> * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
>   "Coercing the legacy C locale to C.UTF-8"
> * `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
>   "Change Windows filesystem encoding to UTF-8"
> * `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
>   "Change Windows console encoding to UTF-8"
> * `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
>   "Non-decodable Bytes in System Character Interfaces"
>
>
> Post History
> ============
>
> * 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
>   <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
> * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
>   540 (assuming UTF-8 for *nix system boundaries)
>   <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
> * 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
>   <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
> * 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
>   C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
> * 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
>   to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
>   -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
>   filesystem encoding to UTF-8)
>
>
> Copyright
> =========
>
> This document has been placed in the public domain.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com