[Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale

Mon May 8 01:34:01 EDT 2017

On 7 May 2017 at 15:22, INADA Naoki <songofacandy at gmail.com> wrote:
> Hi, Nick.
>
> After thinking about relationship between PEP 538 and 540 in two days,
> I came up with idea which removes locale coercion by default from PEP 538,
> it does just enables UTF-8 mode and show warning about C locale.
>
> Of course, this idea is based on PEP 540.  There are no "If PEP 540 is
> rejected".
>
> How do you think?

The main problems I see with this approach are:

1. There's no way to configure earlier Python versions to emulate PEP
540. It's a completely new mode of operation.
2. PEP 540 isn't actually defined yet (Victor is still working on it)
3. Due to 1&2, PEP 540 isn't something 3.6 redistributors can
experiment with backporting to a narrower target audience

By contrast, you can emulate PEP 538 all the way back to Python 3.1 by
setting the following environment variables:

    LC_ALL=C.UTF-8
    LANG=C.UTF-8
    PYTHONIOENCODING=utf-8:surrogateescape

(assuming your platform provides a C.UTF-8 locale and you don't need
to run any Python 2.x components in that same environment)

I think the specific concerns you raise below are valid though, and
I'd be happy to amend PEP 538 to address them all.

> If it make sense, I want to postpone PEP 538 until PEP 540 is
> accepted or rejected, or merge PEP 538 into PEP 540.
>
>
> ## Background
>
> Locale coercion in current PEP 538 has some downsides:
>
> * If user set `LANG=C LC_DATE=ja_JP.UTF-8`, locale coercion may
>   overrides LC_DATE.

The fact it sets "LC_ALL" has previously been raised as a concern with
PEP 538, so it probably makes sense to drop that aspect and just
override "LANG". The scenarios where it makes a difference are
incredibly obscure (involving non-default SSH locale forwarding
settings for folks using SSH on Mac OS X to connect to remote Linux
systems), while just setting "LANG" will be sufficient to address the
"LANG=C" case that is the main driver for the PEP.

That means in the case above, the specific LC_DATE setting would still
take precedence.

> * It makes behavior divergence between standalone and embedded
>   Python.

Such divergence already exists, only in the other direction: embedding
applications may override the runtime's default settings, either by
setting a particular locale, or by using Py_SetStandardStreamEncoding
(which was added specifically to make it easy for Blender to force the
use of UTF-8 on the embedded Python's standard streams, regardless of
the currently locale)

That said, this is also the rationale for my suggestion that we expose
locale coercion as a public API:

    if (Py_LegacyLocaleDetected()) {
        Py_CoerceLegacyLocale();
    }

That would make it straightforward for any embedding application that
wanted to do so to replicate the behaviour of the standard CLI.

The level of divergence is also mitigated by the point in the next section.

> * Parent Python process may use utf-8:surrogateescape, but child process
>   Python may use utf-8:strict.  (Python 3.6 uses ascii:surrogateescape in
>   both of parent and children).

This discrepancy is gone now thanks to your suggestion of making
"surrogateescape" the default standard stream handler when one of the
coercion target locales is explicitly configured - both parent
processes and child processes end up with "utf-8:surrogateescape"
configured on the standard streams.

> On the other hand, benefits from locale coercion is restricted:
>
> * When locale coercion succeeds, warning is always shown.
>   To hide the warning, user must disable coercion in some way.
>   (e.g. use UTF-8 locale explicitly, or set PYTHONCOERCECLOCALE=0).

The current warning is based on what we think is appropriate for
Fedora downstream, but that doesn't necessarily mean its the right
approach for Python upstream, especially if the LC_ALL override is
dropped. We could also opt for a model where Python 3.7 emits the
coercion warning, but Python 3.8 just does the coercion silently (that
rationale would then also apply to PEP 540 - we'd warn on stderr about
the change in default behaviour in 3.7, but take the new behaviour for
granted in 3.8).

The change to make the standard stream error handler setting depend
solely on the currently configured locale also helps here, since it
means it doesn't matter how a process reached the state of having the
locale set to "C.UTF-8". CPython will behave the same way regardless,
so it makes it less import to provide an explicit notice that coercion
took place.

> So I feel benefit / complexity ratio of locale coercion is less than
> UTF-8 mode.

It isn't an either/or though - we're entirely free to do both, one
based solely on the existing configuration options that have been
around since 3.1, and the other going beyond those to also adjust the
default behaviour of other interfaces (like "open()").

> But locale coercion works nice on Android.  And there are some Android-like
> Unix systems (container or small device) that C.UTF-8 is always proper locale.
>
> ## Rough spec
>
> * Make Android-style locale coercion (forced, no warning) is now
>   build option.  Some users who build Python for container or small device
>   may like it.

But do we *want* to support the legacy C locale in 3.7+? I don't think
we do, because it will never work properly for our purposes as long as
it assumes ASCII as the default text encoding.

Part of the motivation for making locale coercion the default is so we
can update PEP 11 to make it clear that running in the legacy C locale
is no longer an officially supported configuration.

> * Normal Python build doesn't change locale.  When python executable is
>   run in C locale, show locale warning.  locale warning can be disabled
>   as current PEP 538.

That still pushes the problem back on end users to fix, though, rather
than just automatically making things like GNU readline integration
work.

> * User can disable automatic UTF-8 mode by setting PYTHONUTF8=0
>   environment variables.  User can hide warning by setting
>   PYTHONUTF8=1 too.

I think I need to better explain in the PEP why PEP 540's UTF-8 mode
on its own won't be enough, as it doesn't necessarily handle
locale-aware extension modules like GNU readline (this came up in the
draft PR review, but I never added anything specifically to the PEP
about it), and also doesn't help at all with invocation of older 3.x
releases in a subprocess.

Here's an interactive session from a PEP 538 enabled CPython, where
each line after the first is executed by doing "up-arrow,
4xleft-arrow, delete, enter"

    $ LANG=C ./python
    Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set
another locale or PYTHONCOERCECLOCALE=0 to disable this locale
coercion behavior).
    Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May  7 2017, 00:21:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌἤ")
    ℙƴ☂ℌἤ
    >>> print("ℙƴ☂ἤ")
    ℙƴ☂ἤ
    >>> print("ℙƴἤ")
    ℙƴἤ
    >>> print("ℙἤ")
    ℙἤ
    >>> print("ἤ")
    ἤ
    >>>

Not exactly exciting, but this is what currently happens on an older
release if you only change the Python level stream encoding settings
without updating the locale settings:

    $ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3
    Python 3.5.3 (default, Apr 24 2017, 13:32:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌ�")
     File "<stdin>", line 0

       ^
    SyntaxError: 'utf-8' codec can't decode bytes in position 20-21:
invalid continuation byte

That particular misbehaviour is coming from GNU readline, *not*
CPython - because the editing wasn't UTF-8 aware, it corrupted the
history buffer and fed such nonsense to stdin that even the
surrogateescape error handler was bypassed. While PEP 540's UTF-8 mode
could technically be updated to also reconfigure readline, that's
*one* extension module, and only when it's running directly as part of
Python 3.7.

By contrast, using a more appropriate locale setting already gets
readline to play nice, even when its running inside Python 3.5:

    $ LANG=C.UTF-8 python3
    Python 3.5.3 (default, Apr 24 2017, 13:32:13)
    [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("ℙƴ☂ℌøἤ")
    ℙƴ☂ℌøἤ
    >>> print("ℙƴ☂ℌἤ")
    ℙƴ☂ℌἤ
    >>> print("ℙƴ☂ἤ")
    ℙƴ☂ἤ
    >>> print("ℙƴἤ")
    ℙƴἤ
    >>> print("ℙἤ")
    ℙἤ
    >>> print("ἤ")
    ἤ
    >>>

Don't get me wrong, I'm definitely a fan of PEP 540, as it extends
much of what PEP 538 covers beyond the standard streams and also
applies it to other operating system interfaces without relying on the
underlying operating system to provide a UTF-8 based locale. However,
I also expect it to be plagued by extension module compatibility
issues if folks attempt to use it standalone, without locale coercion
to reconfigure the behaviour of extension modules appropriately.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia