[Python-ideas] RFC: PEP 540 version 3 (Add a new UTF-8 mode)

Thu Jan 12 12:10:35 EST 2017

2017-01-12 17:10 GMT+01:00 Oleg Broytman <phd at phdru.name>:
>> Does it work to use a locale with encoding A for LC_CTYPE and a locale
>> with encoding B for LC_MESSAGES (and others)? Is there a risk of
>
>    It does when B is a subset of A (ascii and koi8; ascii and utf8, e.g.)

My question is more when A and B encodings are not compatible.

Ah yes, date, thank you for the example. Here is my example using
LC_TIME locale to format a date and LC_CTYPE to decode a byte string:

date.py:
---
import locale, time
locale.setlocale(locale.LC_ALL, "")
b = time.strftime("%a")
encoding=locale.getpreferredencoding()
try:
    u = b.decode(encoding)
except UnicodeError:
    u = '<failed to decode>'
else:
    u = repr(u)
print("bytes: %r, text: %s, encoding: %r" % (b, u, encoding))
---

When all locales are the same, it works fine: 목 (U+baa9) is the expected result

$ LC_TIME=ko_KR.euckr LANG=ko_KR.euckr python2 date.py
bytes: '\xb8\xf1', text: u'\ubaa9', encoding: 'EUC-KR'

You get mojibake if LC_CTYPE uses the Latin1 encoding whereas LC_TIME
uses the EUC-KR encoding: you get "¸ñ" (U+00b8, U+00f1).

$ LC_TIME=ko_KR.euckr LANG=fr_FR python2 date.py
bytes: '\xb8\xf1', text: u'\xb8\xf1', encoding: 'ISO-8859-1'

The program can also fail with UnicodeDecodeError:

$ LC_TIME=ko_KR.euckr LANG=fr_FR.UTF-8 python2 date.py
bytes: '\xb8\xf1', text: <failed to decode>, encoding: 'UTF-8'

Well, since we are talking about the POSIX locale which usually uses
ASCII, it shouldn't be an issue in practice for the PEP 538. I was
just curious :-)

Victor