[Python-ideas] PEP 540: Add a new UTF-8 mode

Thu Jan 12 21:01:01 EST 2017

On Fri, Jan 13, 2017 at 12:12 AM, Victor Stinner
<victor.stinner at gmail.com> wrote:
> 2017-01-12 1:23 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
>> I'm ±0 to surrogateescape by default.  I feel +1 for stdout and -1 for stdin.
>
> The use case is to be able to write a Python 3 program which works
> work UNIX pipes without failing with encoding errors:
> https://www.python.org/dev/peps/pep-0540/#producer-consumer-model-using-pipes
>
> If you want something stricter, there is the UTF-8 Strict mode which
> prevent mojibake everywhere. I'm not sure that the UTF-8 Strict mode
> is really useful. When I implemented it, I quickly understood that
> using strict *everywhere* is just a deadend: it would fail in too many
> places.
> https://www.python.org/dev/peps/pep-0540/#use-the-strict-error-handler-for-operating-system-data
>
> I'm not even sure yet that a Python 3 with stdin using strict is "usable".
>

I want http://bugs.python.org/issue15216 is merged in 3.7.
It allows application select error handler by straightforward API.
So, the problem is "which should be default"?

* Program like `ls` can opt-in surrogateescape.
* Program want to output valid UTF-8 can opt-out surrogateescape.

And I feel former is better, regarding to Python's Zen.
But it's not a strong opinion.

>
>> In output case, surrogateescape is weaker than strict, but it only allows
>> surrgateescaped binary.  If program carefully use surrogateescaped decode,
>> surrogateescape on stdout is safe enough.
>
> What do you mean that "carefully use surrogateescaped decode"?
>
> The rationale for using surrogateescape on stdout is to support this use case:
> https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout

Application which is intended to output surrogateescaped data (filenames) should
use surrogateescape, surely.

But some application is intended to live in UTF-8 world.
For example, think about application reads UTF-8 CSV, and insert it
into database.

When there is CSV encoded by Shift_JIS accidentally, and it is passed to stdin,
error is better than insert it into database silently.

>
>> On the other hand, surrogateescape is very weak for input.  It accepts
>> arbitrary bytes.
>> It should be used carefully.
>
> In my experience with the Python bug tracker, almost nobody
> understands Unicode and locales. For the "Producer-consumer model
> using pipes" use case, encoding issues of Python 3.6 can be a blocker
> issue. Some developers may prefer a different programming language
> which doesn't bother them with Unicode: basicall, *all* other
> programming languages, no?
>

I agree.  Some developer prefer other language (or Python 2) to Python 3,
because of "Unicode by default doesn't fit to POSIX".

Both of "strict by default" and "weak by default" have downside.

>
>> But I agree different encoding handler between stdin/stdout is not beautiful.
>> That's why I'm ±0.
>
> That's why there are two modes: UTF-8 and UTF-8 Strict. But I'm not
> 100% sure yet, on which encodings and error handlers should be used
> ;-) I started to play with my PEP 540 implementation. I already had to
> update the PEP 540 and its implementation for Windows. On Windows,
> os.fsdecode/fsencode now uses surrogatepass, not surrogateescape
> (Python 3.5 uses strict on Windows).
>
> Victor