[Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale

Sat Mar 11 17:36:41 EST 2017

This is a very bad idea.

It seems to based on an assumption that the C locale is always some kind of 
pathology. Admittedly, it sometimes is a result of misconfiguration or a 
mistake. (But I don't see why it's the interpreter's job to correct such 
mistakes.) However, in some cases the C locale is a normal environment for 
system services, cron scripts, distro package builds and whatnot.

It's possible to write Python programs that are locale-agnostic.
It's also possible to write programs that are locale-dependent, but handle 
ASCII as locale encoding gracefully.
Or you might want to write a program that intentionally aborts with an 
explanatory error message when the locale encoding doesn't have sufficient 
Unicode coverage. ("Errors should never pass silently" anyone?)

With this proposal, none of the above seems possible to correctly implement in 
Python.

* Nick Coghlan <ncoghlan at gmail.com>, 2017-03-05, 17:50:
>Another common failure case is developers specifying ``LANG=C`` in order to 
>see otherwise translated user interface messages in English, rather than the 
>more narrowly scoped ``LC_MESSAGES=C``.

Setting LANGUAGE=en might be better, because it doesn't affect locale encoding 
either, and it works even when LC_ALL is set.

>Three such locales will be tried:
>
>* ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and 
>expected to be available by default in a future version of glibc)
>* ``C.utf8`` (available at least in HP-UX)
>* ``UTF-8`` (available in at least some \*BSD variants)

Calling the C locale "legacy" is a bit unfair, when there's even no agreement 
what the name of the successor is supposed to be...

NB, both "C.UTF-8" and "C.utf8" work on Fedora, thanks to glibc normalizing the 
encoding part. Only "C.UTF-8" works on Debian, though, for whatever reason.

>For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually 
>setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate 
>locale name,

Sounds wrong. This will override all LC_*, even if they were originally set to 
something different that C.

>Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another locale 
>or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

Comma splice.

s/set/was set/ would probably make it clearer.

>Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale or 
>PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

Ditto.

>The second sentence providing recommendations would be conditionally compiled 
>based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD 
>systems.

Note that at least OpenBSD supports both "C.UTF-8" and "UTF-8" locales.

>While this PEP ensures that developers that need to do so can still opt-in to 
>running their Python code in the legacy C locale,

Yeah, no, it doesn't.

It's impossible do disable coercion from Python code, because it happens to 
early. The best you can do is to write a wrapper script in a different language 
that sets PYTHONCOERCECLOCALE=0; but then you still get a spurious warning.

-- 
Jakub Wilk